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Abstract 

New applications of data mining, such as in biology, bioinformatics, or 
sociology, are faced with large datasets structured as graphs. We intro- 
duce a novel class of tree-shaped patterns called tree queries, and present 
algorithms for mining tree queries and tree-query associations in a large 
data graph. Novel about our class of patterns is that they can contain 
constants, and can contain existential nodes which are not counted when 
determining the number of occurrences of the pattern in the data graph. 
Our algorithms have a number of provable optimality properties, which 
are based on the theory of conjunctive database queries. We propose a 
practical, database-oriented implementation in SQL, and show that the 
approach works in practice through experiments on data about food webs, 
protein interactions, and citation analysis. 

1 Introduction 

The problem of mining patterns in graph-structured data has received consid- 
erable attention in recent years, as it has many interesting applications in such 
diverse areas as biology, the life sciences, the World Wide Web, or social sci- 
ences. In the present work we introduce a novel class of patterns, called tree 
queries, and we present algorithms for mining these tree queries and tree-query 
associations in a large data graph. This article is based on two earlier conference 
papers [T7H2D]- 

Tree queries are powerful tree-shaped patterns, inspired by conjunctive data- 
base queries |18) . In comparison to the kinds of patterns used in most other 
graph mining approaches, tree queries have some extra features: 

• Patterns may have "existential" nodes: any occurrence of the pattern must 
have a copy of such a node, but existential nodes are not counted when 
determining the number of occurrences. 

• Moreover, patterns may have "parameterized" nodes, labeled by con- 
stants, which must map to fixed designated nodes of the data graph. 

• An "occurrence" of the pattern in a data graph G is defined as any homo- 
morphism from the pattern in G. When counting the number of occur- 
rences, two occurrences that differ only on existential nodes are identified. 

Past work in graph mining has dealt with node labels, but only with non- 
unique ones: such labels are easily simulated by constants, but the converse 
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Figure 1: Simple examples of tree-query patterns 



is not obvious. It is also possible to simulate edge labels using constants. To 
simulate a node label a, add a special node a, and express that node x has label 
a by drawing an edge from x to a. For an edge x — > y labeled b, introduce an 
intermediate node x.y with x — > x.y — > y, and label node x.y by b. 

A simple example of a tree query is shown in Figure l(a)| when applied 



to a food web: a data graph of organisms, where there is an edge x — > y if 
y feeds on x, it describes all organisms x that compete with organism ^=8 for 
some organism as food, that itself feeds on organism #0. This pattern has one 



existential node, two parameters, and one distinguished node x. Figure 1(b) 
shows another example of a tree query; when applied to a food web, it describes 
all organisms x that have a path of length four beneath them that ends in 
organism #8. 

Effectively, tree queries are what is known in database research as conjunctive 
queries [SJIiniH]; these are the queries we could pose to the data graph (stored as 
a two-column table) in the core fragment of SQL where we do not use aggregates 
or subqueries, and use only conjunctions of equality comparisons as where- 



conditions. For example, the pattern of Figure 1(a) amounts to the following 
SQL query on a table G (from, to): 

select distinct G3.to as x 
from G Gl, G G2, G G3 
where Gl.from=0 and Gl . to=G2 . f rom 
and G2.to=8 and G3 . f rom=G2 . f rom 

In the present work we also introduce association rules over tree queries. By 
mining for tree-query associations we can discover quite subtle properties of the 



data graph. Figure 2(a) shows a very simple example of an association that our 
algorithm might find in a social network: a data graph of persons where there 
is an edge x — > y if x considers y to be a close friend. The tree query on the left 
matches all pairs (xi,X2) of "co- friends" : persons that are friends of a common 
person (represented by an existential variable). The query on the right matches 
all co- friends x\ of person #5 (represented by a parameterized node), and pairs 
all those co-friends to person #5. Now were the association from the left to the 
right to be discovered with a confidence of c, with < c < 1, then this would 
mean that the pairs retrieved by the right query actually constitute a fraction 
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Figure 2: Simple examples of association rules over tree queries. 

of c of all pairs retrieved by the left query, which indicates (for nonnegligible c) 
that 5 plays a special role in the network^] 



Figure 2(b) shows quite a different, but again simple, example of a tree-query 
association that our algorithm might discover in a food web. With confidence 
c, this association means that of all organisms that are not on top of the food 
chain (i.e., they are fed upon by some other organism), a fraction of c is actually 
at least two down in the food chain. 

The examples of tree queries and associations we just saw are didactical 
examples, but in Scction[7]we will see more complicated examples of tree queries 
and associations mined in real-life datasets. 

In this paper we present algorithms for mining tree queries and associations 
rules over tree queries in a large data graph. Some important features of these 
algorithms are the following: 

1. Our algorithms belong to the group of graph mining algorithms where the 
input is a single large graph, and the task is to discover patterns that occur 
sufficiently often in the single data graph. We will refer to this group of 
algorithms as the single graph category. There is also a second category 
of graph mining algorithms, called the transactional category, which is 
explained in Section [5J 

2. We restrict to patterns that are trees, such as the example in Figure [T] 
Tree patterns have formed an important special case in the transactional 
category (Section 0), but have not yet received special attention in the 
single-graph literature. Note that the data graph that is being mined is 
not restricted in any way. 

3. The tree-query-mining algorithm is incremental in the number of nodes of 
the pattern. So, our algorithm systematically considers ever larger trees, 
and can be stopped any time it has run long enough or has produced 
enough results. Our algorithm does not need any space beyond what is 
needed to store the mining results. Thanks to the restriction to tree shapes 
the duplicate-free generation of trees can be done efficiently. 



1 Note that this does not just mean that 5 has many co-friends; if we only wanted to express 
that, just a frequent pattern in the form of the right query would suffice. For instance, imagine 
a data graph consisting of n disjoint 2-cliques (pairs of persons who have each other as a 
friend), where additionally all these persons also consider 5 to be an extra friend (but not 
vice versa). In such a data graph, 5 is a co- friend of everybody, and the association has a 
rather high confidence of more than 2/7. If, however, we would now add to the data graph a 
separate n-clique, then still 2/3rds of all persons are a co-friend of 5, which is still a lot, but 
the confidence drops to below 2/n. 
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4. For each tree, all conjunctive queries based on that tree are generated in 
the tree-query-mining algorithm. Here, we work in a levelwise fashion in 
the sense of Mannila and Toivonen [31] . 

5. As in classical association rules over itemsets [5], our association rule gen- 
eration phase comes after the generation of frequent patterns and does 
not require access to the original dataset. 

6. We apply the theory of conjunctive database queries [9j|40j[2] to formally 
define and to correctly generate association rules over tree queries. The 
conjunctive- query approach to pattern matching allows for an efficiently 
checkable notion of frequency, whereas in the subgraph-based approach, 
determining whether a pattern is frequent is NP-complete (in that ap- 
proach the frequency of a pattern is the maximal number of disjoint sub- 
graphs isomorphic to the pattern [29]). 

7. There is a notion of equivalence among tree queries and association rules 
over tree queries. We carefully and efficiently avoid the generation of 
equivalent tree queries and associations, by using and adapting what is 
known from the theory of conjunctive database queries. Due to the re- 
striction to tree shapes, equivalence and redundancy (which are normally 
NP-complete) are efficiently checkable. 

8. Last but not least, our algorithms naturally suggest a database-oriented 
implementation in SQL. This is useful for several reasons. First, the 
number of discovered patterns can be quite large, and it is important to 
keep them available in a persistent and structured manner, so that they 
can be browsed easily, and so that association rules can be derived effi- 
ciently. Moreover, we will show how the use of SQL allows us to generate 
and check large numbers of similar patterns in parallel, taking advan- 
tage of the query processing optimizations provided by modern relational 
database systems. Third, a database-oriented implementation does not re- 
quire us to move the dataset out of the database before it can be mined. In 
classical itemset mining, database-oriented implementations have received 
serious attention 1391 136] , but less so in graph mining, a recent exception 
being an implementation in SQL of the seminal SUBDUE algorithm [5]. 

The purpose of this paper is to introduce tree queries and tree-query associ- 
ations and to present algorithms for mining tree queries and tree-query associa- 
tions. Concrete applications to discover new knowledge about scientific datasets 
are the topic of current research. Yet, the algorithms are fully implemented and 
we can already show that our approach works in practice, by showing some 
concrete results mined from a food web, a protein interactions graph, and a 
citation graph. We will also give performance results on random data graphs 
(as a worst-case scenario). 

2 Related Work 

Approaches to graph mining, especially mining for frequent patterns or associa- 
tion rules, can be divided in two major categories which are not to be confused. 
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1. In transactional graph mining, e.g., [H |2D m 123 119 SD 52] , the dataset 
consists of many small data graphs which we call transactions, and the 
task is to discover patterns that occur at least once in a sufficient number 
of transactions. (Approaches from machine learning or inductive logic 
programming usually call the small data graphs "examples" instead of 
transactions.) 

2. In single-graph mining the dataset is a single large data graph, and the 
task is to discover patterns that occur sufficiently often in the dataset. 

Note that single-graph mining is more difficult than transactional mining, 
in the sense that transactional graph mining can be simulated by single-graph 
mining, but the converse is not obvious. 

Since our approach falls squarely within the single-graph category, we will 
focus on that category in this section. Most work in this category has been done 
on frequent pattern mining, and less attention has been spend on association 
rules. We briefly review the work in this category next: 

• Cook and Holder [TT] apply in their SUBDUE system the minimum de- 
scription length (MDL) principle to discover substructures in a labeled 
data graph. The MDL principle states that the best pattern, is that pat- 
tern that minimizes the description length of the complete data graph. 
Hence, in SUBDUE a pattern is evaluated on how well it can compress 
the entire dataset. The input for the SUBDUE system is a labeled data 
graph; nodes and edges are labeled with non-unique labels. This is in con- 
trast with the unique labels ('constants') in our system. But as we already 
noted, non-unique node labels and edge-labels can easily be simulated by 
constants, but the converse is not obvious. The SUBDUE system only 
mines patterns, no association rules. 

• Ghazizadeh and Chawathe [16] mine in their SEuS system for connected 
subgraphs in a labeled, directed data graph, as in the SUBDUE system. 
Instead of generating candidate patterns using the input data graph, SEuS 
uses a summary of the data graph. This summary gives an upper bound 
for the support of the patterns, and the user can then select those patterns 
of which he wants to know the exact support. SEuS also only mines for 
frequent patterns and not for associations. 

• Vanetik, Gudes, and Shimony [19] propose an Apriori-like 3 algorithm 
for mining subgraphs from a labeled data graph. The support of a graph 
pattern is defined as the maximal number of edge-disjoint instances of the 
pattern in the data graph. By reducing the support counting problem to 
the maximal independent set problem on graphs, they show that in worst 
case, computing the support of a graph pattern is NP-hard. They propose 
an Apriori-like algorithm to minimize the number of patterns for which 
the support needs to be computed. The major idea of their approach is 
using edge-disjoint paths as building blocks instead of items in classical 
itemset mining. Vanetik, Gudes, and Shimony also only mine for frequent 
patterns in the data graph. 

• Kuramochi en Karypis [55] use the same support measure for graph pat- 
terns as Vanetik, Gudes and Shimony 19! . They also note that computing 
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the support of a graph pattern is NP-hard in worst case, since it can be 
reduced to finding the maximum independent set (MIS) in a graph. Ku- 
ramochi and Karypis quickly compute the support of a graph pattern 
using approximate MIS- algorithms. The number of candidate patterns is 
restricted using canonical labeling. As the majority of algorithms, Ku- 
ramochi and Karypis only mine for frequent patterns. 

• Jeh and Widom |24j consider patterns that are, like our tree queries, in- 
spired by conjunctive database queries, and they also emphasize the tree- 
shaped case. A severe restriction, however, is that their patterns can be 
matched by single nodes only, rather than by tuples of nodes. Still their 
work is interesting in that it presents a rather nonstandard approach to 
graph mining, quite different from the standard incremental, levelwise ap- 
proach, and in that it incorporates ranking. Jeh and Widom mention 
association rules as an example of an application of their mining frame- 
work. 

The related work that was most influential for us is Warmr [T3] , although 
it belongs to the transactional category. Based on inductive logic programming, 
patterns in Warmr also feature existential variables and parameters. While 
not restricted to tree shapes, the queries in Warmr are restricted in another 
sense so that only transactional mining can be supported. Association rules in 
Warmr are defined in a naive manner through pattern extension, rather than 
being founded upon the theory of conjunctive query containment. The Warmr 
system is also Prolog-oriented, rather than database-oriented, which we believe 
is fundamental to mining of single large data graphs, and which allows a more 
uniform and parallel treatment of parameter instantiations, as we will show in 
this paper. Finally, Warmr does not seriously attempt to avoid the generation 
of duplicates. Yet, Warmr remains a pathbreaking work, which did not receive 
sufficient follow-up in the data mining community at large. We hope our present 
work represents an improvement in this respect. Many of the improvements we 
make to Warmr were already envisaged (but without concrete algorithms) in 
2002 by Goethals and the second author [18] . 

Finally, we note that parameterized conjunctive database queries have been 
used in data mining quite early, e.g., [331 I3E], but then in the setting of "data 
mining query languages" , where a single such query serves to specify a family 
of patterns to be mined or queried for, rather than the mining for such queries 
themselves, let alone associations among them. 

3 Problem Statement 

In this section we define some concepts formally. In the appendix an overview 
of all notations used in this paper is given. 

We basically assume a set U of data constants from which the nodes of the 
data graph to be mined will be taken. 

3.1 Graph-theoretic concepts 

Let N C U be any finite set of nodes; nodes can be any data objects such as 
numbers or strings. For our purposes, we define a (directed) graph on N as 
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Figure 3: (a) is a parameterized tree pattern, and (b) is an instantiation of (a). 



a subset of N 2 , i.e., as a finite set of ordered pairs of nodes. These pairs are 
called edges. We assume familiarity with the notion of a tree as a special kind 
of graph, and with standard graph-theoretic concepts such as root of a tree; 
children, descendants, parent, and ancestors of a node; and path in a graph. 
Any good algorithms textbook will supply the necessary background. 

In this paper all trees we consider are rooted and unordered, unless stated 
otherwise. 

3.2 Tree Pattern 

Tree Patterns A parameterized tree pattern P is a tree whose nodes are called 
variables, and where additionally: 

• Some variables may be marked as being existential; 

• Some other variables may be marked as parameters; 

• The variables of P that are neither existential nor parameters are called 
distinguished. 

We will denote the set of existential variables by II, the set of parameters by 
S, and the set of distinguished variables by A. To make clear that these sets 
belong to some parameterized tree pattern P we will use a subscript as in lip 
or Ep. 

A parameter assignment a, for a parameterized tree pattern P, is a mapping 
£ — > U which assigns data constants to the parameters. 

An instantiated tree pattern is a pair (P,a), with P a parameterized tree 
pattern and a a parameter assignment for P. We will also denote this by P a . 

When depicting parameterized tree patterns, existential nodes are indicated 
by labeling them with the symbol '3' and parameters are indicated by labeling 
them with the symbol 'er'. When depicting instantiated tree patterns, parame- 
ters are indicated by directly writing down their parameter assignment. 

Figure [3] shows an illustration. 

Matching Recall that a homomorphism from a graph G\ to a graph Gi is a 
mapping /i from the nodes of G\ to the nodes of Gi that preserves edges, i.e., if 
(i,j) € G\ then (fj,(i) , fj,(j)) G Gi. We now define a matching of an instantiated 
tree pattern P a in a data graph G as a homomorphism fi from the underlying 
tree of P to G, with the constraint that for any parameter a, if a(a) = a, then 
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Figure 4: Two data graphs. 



fj,(a) must be the node a. We denote the set {^|a 
by P a {G). 



fj, is a matching of P a in G} 



Frequency of a tree pattern The frequency of an instantiated tree pattern 
P a in a data graph G, is formally defined as the cardinality of P a (G). So, 
we count the number of matchings of P a in G, with the important provision 
that we identify any two matchings that agree on the distinguished variables. 
Indeed, two matchings that differ only on the existential nodes need not be 
distinguished, as this is precisely the intended semantics of existential nodes. 
Note that we do not need to worry about selected nodes, as all matchings will 
agree on those by definition. For a given threshold k (a natural number) we 
say that P a is k- frequent if its frequency is at least k. Often the threshold is 
understood implicitly, and then we talk simply about "frequent" patterns and 
denote the threshold by minsup. 



Example. Take again the instantiated tree pattern P a shown in Figure 3(b) Let 
us name the existential node by y; let us name the parameter labeled by z\\ the 
parameter labeled 8 by z 2 ; and the parameter labeled 6 by Z3. The distinguished 
node already has the name x. Now let us apply P a to the simple example data 
graph G shown in Figure 4(a) The following table lists all matchings of P a in 
G: 





Zl 


X 


y z 2 


Z3 


hi 
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4 8 
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h 2 
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4 8 


6 


h 3 
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5 8 


6 



As required by the definition, all matchings match z\ to 0, z 2 to 8, and 2:3 to 6. 
Although there are three matchings, when determining the frequency of P a in 
G, we only look at their value on x to distinguish them, as y is existential. So, 
h 2 and /13 are identified as identical matchings when counting the number of 
matchings. In conclusion, the frequency of P a in G is two, as x can be matched 
to the two different nodes 1 and 2. □ 



3.3 Tree Query 

Tree Queries A parameterized tree query Q is a pair (H, P) where: 
1. P is a parameterized tree pattern, called the body of Q; 
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Figure 5: (a) and (b) are parameterized tree queries; (c) is an instantiation of 
(a); (d) is an instantiation of (b); and query (b) is p-contained in query (a) 

2. H is a tuple of distinguished variables and parameters coming from P. 
All distinguished variables of P must appear at least once in H. We call 
H the head of Q. 

A parameter assignment for Q is simply a parameter assignment for its body, 
and an instantiated tree query is then again a pair (Q, a) with Q a parameterized 
tree query and a a parameter assignment for Q. We will again also denote this 
byQ". 

When depicting tree queries, the head is given above a horizontal line, and 
the body below it. Two illustrations are given in Figure [5j 

Frequency of a tree query The frequency of an instantiated tree query 
Q a = ((if, P), a) in a data graph G, is defined as the frequency of the body P a 
in G. When G is understood, we denote the frequency by Freq(P a ). For a given 
threshold k (a natural number) we say that Q a is k-frequent if its frequency is 
at least k. Again, this threshold is often understood implicitly, and then we talk 
simply about "frequent" queries and denote the threshold by minsup. 

Containment of tree queries An important step towards our formal defini- 
tion of tree-query association is the notion of containment among queries. Since 
queries are parameterized, a variation of the classical notion of containment 
[5J HOI H] is needed in that we now need to specify a parameter correspondence. 

First, we define the answer set of an instantiated tree query Q a , with Q = 
(H, P), in a data graph G as follows: 

Q a (G) := {fi(H) n is a matching of P a in G} 

Consider two parameterized tree queries Q\ and Q2, with Qi = (iJj,Pj) 
for i = 1,2. A parameter correspondence from Qi to Q2 is any mapping p : 
Si — > £2. We then say that a parameterized tree query Qi is p-contained in 
a parameterized tree query Q\, if for every 0.2, a parameter assignment for Q2, 
Q2 2 (G) C Qi 2 ° p (G) for all data graphs G. In shorthand notation we write this 
as Q 2 C p Qi. 

Containment as just defined is a semantical property, referring to all possible 
data graphs, and it is not immediately clear how one could decide this property 
syntactically. The required syntactical notion for this is that of p- containment 
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mapping, which we next define in two steps. For the tree queries Q\ and Qi as 
above, and p a parameter correspondence from Qi to Qi. 

1. A p-containmcnt mapping from P\ to Pi is a homomorphism / from the 
underlying tree of P\ to the underlying tree of P2, with the properties: 

(a) / maps the distinguished nodes of Pi to distinguished nodes or pa- 
rameters of P2; and 

(b) /|si == P, i-e., for each z G Si we have f(z) = p(z). 

2. Finally, a p-containmcnt mapping from Qi to Qi is a p-containment map- 
ping / from Pi to P2 such that /(-Hi) = #2- 

For later use, we note: 

Lemma 1. Consider three parameterized tree patterns Pi, Pi, and P3, a param- 
eter correspondence p\ : £1 — > £2, a parameter correspondence pi : £2 — > £3, a 
pi- containment mapping fi from Pi to Pi, and a pi- containment mapping fi 
from Pi to P3. Then fi o fi is a (pi o pi)- containment mapping from Pi to P3. 

Proof. We will show that: 

!■ /2 /1 is homomorphism; 

2- /2°/i maps distinguished nodes of Pi to distinguished nodes or parameters 
of P 3 ; and 

3- {fi O /l)|si = P2 o pi. 

(1) Clearly fi a fx is a homomorphism since both fi and /2 are homomor- 
phisms, and it is already known that a composition of homomorphisms is a 
homomorphism. 

(2) Consider a x\ G Ai, then there are two possibilities for fi{xi): 

!■ /l(^i) — ^2, with X2 G A2. Then we know, since fi is a p2-containment 
mapping, that fi{xi) is either a distinguished node £3 G A3, or a param- 
eter Z3 G £3. 

2- fi(%i) — zi, with Z2 G £2- Then we know, since fi\s 2 = pi, that fi{zi) = 
Z 3 , with z 3 G £3. 

Hence, we can conclude that fi o f 2 maps distinguished nodes of Pi to distin- 
guished nodes or parameters of P3. 

(3) For each z x G £1, we have h{fx{z x )) = p 2 (pi(zi)). Hence, (/2°/i)|si = 
p 2 opi. □ 

From the theory of conjunctive database queries [9l 001 [2] we can derive the 
following: 

Lemma 2. Consider two parameterized tree queries Qi and Qi, with Qi = 
(Hi, Pi) andQi = (H 2 ,Pi) and a parameter correspondence p : £1 — > £2. Then 
Qi is p-contained in Qi (Qi C p Qi), if and only if there exists a p-containment 
mapping from Qi to Qi. 
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Proof. Let us start with the 'only if direction. We first introduce the concept 
of a freezing of a parameterized tree query Q = (H, P) . Recall that U is the set 
of data constants from which the nodes of the data graph to be mined will be 
taken. A freezing j3 of P is then a one-to-one mapping from the nodes of P to 
U. We denote by freeze^ (P) the data graph constructed from P by replacing 
each node n of P by (3(ri), and we denote by freeze^ (H) the tuple constructed 
from H by replacing each node n in H by the data constant /3(n). 

For example, consider the parameterized tree query Q = (P, P) in Fig- 
6(a) Figure [6(b)| shows freeze^ (P) and freeze^ (P) for the freezing (3 given 



as follows: x% — > c\\ x-i — > ci\ 33 — f C3; X4 — > C4; X5 — > C5; o§ — > eg. 

We can now continue with the proof of the 'only if direction. Consider 
a freezing f3 from the nodes of P2 to U . Note that /3|e 2 is a parameter as- 
signment for Q2, and freeze^P^) € Q2 ' S2 (freeze(P2)). Since Q2 Qp Qi, also 
freeze / 3(P2) £ Q± S2 ° P (freeze(P2)). Hence, there must be a matching /i from 
pP\x 2 °P - m f r eeze^(p2) such that /i(Pi) = freeze i a(P2)- Now consider the func- 
tion g : /3 _1 o /i. We show that g is p-containment mapping from Q\ to Q2: 

1. Clearly, g is a homomorphism from Pi to P2 since \x is a homomorphism 
and is an isomorphism. Also the following properties hold for g: 

(a) g maps distinguished nodes of Pi to distinguished nodes or parame- 
ters of P2 since g{Hi) = H2 (as shown in (2)); and 

(b) for each z £ Si: g(z) = (3^ 1 (fj,(z)) = f3^ 1 (P(p(z))) = p(z), hence 
fflsj = P 

2. g{Hi) - /TVCtfi)) = ( fr eeze^ (F 2 )) = P 2 - 

Hence, we conclude that g is a p-containment mapping from Qi to Q2- 

Let us then look at the 'if direction. Let h be the p- containment mapping 
from Qi to Q2. Consider an arbitrary parameter assignment a 2 for Qi- We 
must prove that for every data graph G, if a 6 Q2 2 (G), then also a 6 Q^ l0p {G). 
Consider such an arbitrary data graph G. Since, a £ Q2 2 (G), we know that 
there exists a matching /1 of P 2 Q2 in G such that a — ^(Hi). Now consider 
the function g — /1 o h. We show that g is a matching from p" 2 ° p in G and 
a = g(H 1 ): 

1. 5 is a homomorphism since both /i and 5 are homomorphisms; and 

2. for each z£Si we have 3(2) = fi(h(z)) = /i(p(z)) = ot2(p(z)). 

So, g is indeed a matching of p" 2 ° p m G. Finally, we observe that g(Hi) = 
n(h(H\)) — p{H2) — a, as desired. □ 

Checking for a containment mapping is evidently computable, and although 
the problem for general database conjunctive queries is NP-complete, our re- 
striction to tree shapes allows for efficient checking, as we will see later. 

Example. Consider the parameterized and instantiated tree queries shown in 



Figure [5] In the example data graph in Figure 4(a) the frequency of query (c) 
is 10 and that of query (d) is 2. Let S a be the set of parameters of query (a), 
and let be the set of parameters of query (b); then let the parameter cor- 
respondence p : T, a — > Eh be as follows: o\ — >■ a%; o<x — > 02. A moment's 
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Figure 6: (b) is a freezing of the parameterized tree query in (a) 



reflection should convince the reader that (b) is p-contained in (a), and indeed 
a p-containment mapping / from (a) to (b) can be found as follows: 
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3.4 Tree-Query Association 

Association Rules A parameterized association rule (pAR) is of the form 
Qi Qii with Qi and Q 2 parameterized tree queries and p a parameter 
correspondence from Ei to E 2 . We call a pAR legal if Q 2 C p Qi. We call 
Qi the left-hand side (lhs), and Q 2 the right-hand side (rhs). A parameter 
assignment a, for a pAR, is a mapping E 2 — >• [/ which assigns data constants to 
the parameters. An instantiated association rule (iAR) is a pair (Qi Q 2 ,a), 
with Qi => p Qi a pAR and a a parameter assignment for Qi ^> p Q 2 . Note that 
while a is only defined on the rhs, we can also apply it to the lhs by using p 
first. 

Confidence The confidence of an iAR in a data graph G is defined as the 
frequency of Q 2 2 divided by the frequency of Q" 2 ° p . If the AR is legal, we know 
that the answer set of Q 2 2 is a subset of the answer set of Q" 2 ° p , and hence the 
confidence equals precisely the proportion that the Q 2 2 answer set takes up in 
the <5" 2 ° p answer set. Thus, our notions of a legal pAR and confidence are very 
intuitive and natural. 

For a given threshold c (a rational number, < c < 1) we say that the 
iAR is c-confident in G if its confidence in G is at least c. Often the threshold 
is understood implicitly, and then we talk simply about "confident" iARs and 
denote the threshold by minconf. 

Furthermore, the iAR is called frequent in G if Q 2 2 is frequent in G. Note 
that if the iAR is legal and frequent, then also Q" 2 ° p is frequent, since the rhs 
is p-contained in the lhs. 
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Example. Continuing the previous example, we can see that we can form a legal 
pAR from the queries of Figure with (a) the lhs and (b) the rhs and p as 
follows: <7i — > a\) a% — > 02. We can also form an iAR with the tree queries 
in Figure 5(c) and Figure 5(d)} the confidence of this iAR in the data graph of 



Figure 4(a) is 2/10. Many more examples of ARs are given in Section \b[ 



3.5 Mining Problems 

We are now finally ready to define the graph mining problems we want to solve. 

3.5.1 Mining Tree Queries 

Input: A data graph G; a threshold minsup. 

Output: All frequent instantiated tree queries Q — ((H,P),a). 

In theory, however, there are infinitely many fc-frequent tree queries, and 
even if we set an upper bound on the size of the patterns, there may be expo- 
nentially many. As an extreme example, if G is the complete graph on the set of 
nodes {1, . . . , n}, and k < n, then any instantiated pattern with all parameters 
assigned to values in {1, . . . , n}, and with at least one distinguished variable, is 
frequent. 

Hence, in practice, we want an algorithm that runs incrementally, and that 
can be stopped any time it has run long enough or has produced enough results. 
We introduce such an algorithm in Section 0] 

3.5.2 Association Rule Mining 

Input: A data graph G; a threshold minsup; a parameterized tree query Qi ft; 
and a threshold minconf. 

Output: All iARs (Q\ c it =^p Qright,a) that are legal, frequent and confident 
inG 



In theory, however, there are infinitely many legal, frequent and confident 
association rules for a fixed lhs, and even if we set an upper bound on the size 
of the rhs, there may be exponentially many. Hence, in practice, we want an 
algorithm that runs incrementally, and that can be stopped any time it has run 
long enough or has produced enough results. We introduce such an algorithm 
in Section [5] 



4 Mining Tree Queries 

In this Section we present an algorithm for mining frequent instantiated tree 
queries in a large data graph. But first we show that we do not need to tackle 
the problem in its full generality. 

4.1 Problem Reduction 

In this subsection we show that, without loss of generality, we can focus on 
parameterized tree queries that are 'pure'. 
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Figure 7: (a) is an impure parameterized tree query. The parameterized tree 
query in (b) is the purification of the parameterized tree query (a), and expresses 
precisely the same information. 
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Figure 8: (b) is the pure instantiated tree query constructed from the instanti- 
ated tree pattern in (a) 



Pure Tree Queries To define this formally, assume that all possible variables 
(nodes of tree patterns) have been arranged in some fixed but arbitrary order. 
We then call a parameterized tree query Q — (iJ, P) pure when H consists 
of the enumeration, in order and without repetitions, of all the distinguished 
variables of P. In particular H cannot contain parameters. We call H the pure 



head for P. As an illustration, the parameterized tree query in Figure 5(a} 
pure, while the parameterized tree query in Figure 5(b) is not pure. 

A parameterized tree query that is not pure can always be rewritten to a 
parameterized tree query that is pure, in such a way that all instantiations of 
the impure query correspond to instantiations of the pure query, with the same 
frequency. Indeed, take a parameterized tree query Q = (H, P). We can purify 
Q by removing all parameters and repetitions of distinguished variables from 
H, and sort H by the order on the variables. An illustration of this is given in 
Figure [71 

We can conclude that it is sufficient to only consider pure instantiated tree 
queries. As a consequence, rather than mining tree queries, it suffices to mine 
for tree patterns, because the frequency of a query is nothing else then the 
frequency of his body, i.e., a pattern. An illustration is given in Figure [HI 
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4.2 Overall Approach 

An overall outline of our tree-query mining algorithm is the following: 

Outer loop: Generate, incrementally, all possible trees T of increasing sizes. 
Avoid trees that are isomorphic to previously generated ones. 

Inner loop: For each T, generate all instantiated tree patterns P a based on 
T, and test their frequency. 

The algorithm is incremental in the number of nodes of the pattern. We 
generate canonically ordered rooted trees of increasing sizes, avoiding the gen- 
eration of isomorphic duplicates. It is well known how to do this efficiently 
[571 1501 021 [TUJ . Note that this generation of trees is in no way "levelwise" [5T] . 
Indeed, under the way we count pattern occurrences, a subgraph of a pattern 
might be less frequent than the pattern itself (this was already pointed out by 
Kuramochi and Karypis [29 ). So, our algorithm systematically considers ever 
larger trees, and can be stopped any time it has run long enough or has pro- 
duced enough results. Our algorithm does not need any space beyond what 
is needed to store the mining results. The outer loop of our algorithm will be 
explained in more detail in Section 14.31 

For each tree, all conjunctive queries based on that tree are generated. Here, 
we do work in a levelwise fashion. This aspect of our algorithm has clear sim- 
ilarities with "query flocks" [53]. A query flock is a user-specified conjunctive 
query, in which some constants are left unspecified and viewed as parameters. 
A levelwise algorithm was proposed for mining all instantiations of the param- 
eters under which the resulting query returns enough answers. We push that 
approach further by also mining the query flocks themselves. Consequently, 
the specialization relation on queries used to guide the levelwise search is quite 
different in our approach. The inner loop of our algorithm will be explained in 
more detail in Section l4~4l 

A query based on some tree may be equivalent to a query based on a pre- 
viously seen tree. Furthermore, two queries based on the same tree may be 
equivalent. We carefully and efficiently avoid the counting of equivalent queries, 
by using and adapting what is known from the theory of conjunctive database 
queries. This will be discussed in Section l4~5l 

4.3 Outer Loop 

In the outer loop we generate all possible trees of increasing sizes and we avoid 
trees that are isomorphic to previously generated ones. In fact, it is well known 
how to do this [57J [501 0H [TO] . What these procedures typically do is generating 
trees that are canonically ordered in the following sense. Given an (unordered) 
tree T, we can order the children of every node in some way, and call this an 
ordering of T. For example, Figure [9] shows two orderings of the same tree. 
From the different orderings of a tree T, we want to uniquely select one, to 
be the canonical ordering of T. For each such possible ordering of T, we can 
write down the level sequence of the resulting tree. This is actually a string 
representation of the resulting tree. This level sequence is as follows: if the tree 
has n nodes then this is a sequence of n numbers, where the ith number is the 
depth of the ith node in preorder. Here, the depth of the root is 0, the depth of 
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Figure 9: Two orderings of the same tree. The left one is canonical. 



its children is 1, and so on. The canonical ordering of T is then the ordering of 
T that yields the lexicographically maximal level sequence among all possible 
orderings of T. 

For example, in Figure the left one is the canonical one. 

4.4 Inner Loop 

Let G be the data graph being mined, and let U be its set of nodes. In this 
section, we fix a tree T, and we want to find all instantiated tree patterns P a 
based on T whose frequency in G is at least minsup. 

This tasks lends itself naturally to a levelwise approach [3T]. A natural 
choice for the specialization relation is suggested by an alternative notation for 
the patterns under consideration. Concretely, since the underlying tree T is 
fixed, any parameterized tree pattern P based on T is characterized by two 
parameters: 

1. The set II of existential nodes; 

2. The set E of parameters. 

Note that II and S are disjoint. 

Thus, a parameterized tree pattern P is completely characterized by the pair 
(II, E). An instantiation P a of P is then represented by the triple (II, E, a). For 
two parameterized tree patterns P\ = (Hi, Ei) and P2 — (TI2, E2) we now say 
that Pi specializes P2 if IT D II 2 and Ei 3 E 2 ; and a 2 — ai|s 2 - We also say 
that P2 generalizes P±. 

Parent An immediate generalization of a tree pattern is called a parent. For- 
mally, let P = (IT, E) and P 1 = (II', E') be parameterized tree patterns based 
on T. We say that P' is a parent of P if: 

(i) E = E' and II = II' U {y} for some node y IT' ; or 

(ii) IT = IT and E = £' U {z} for some node z E'. 

From the following lemma, it follows that specialized patterns have a lower 
frequency, as expected for a specialization relation: 

Lemma 3. Let P and P' be parameterized tree patterns such that P' is a parent 
of P. Let P a be an instantiation of P, and let a' = a\s> . Then Freq(P a ) < 

Freq(P' a '). 

Proof. We will show that #P a (G) < #P' Q '(G) by defining an injection / : 
P a {G) -> P' a '{G). 
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Since P' is a parent of P, we know that A' = A U {u} where u is either an 
existential node or a parameter of P. Note that each \i <G P a (G) is of the form 
~P\a for some matching ~p of P a € G. For each \x in P a (G) 1 we fix arbitrarily ~p. 
Now we define I(p) := /J|a'- To see that / is an injection, let Mi>M2 S P a (G) 
and suppose that /(mi) = I(tii)- In other words, PT|a' = 7*2 1 A' ■ In particular, 
Mi = Mil A = 7*2 1 a = fJ-2, as desired. 

Hence, we can conclude that #P"(G) < #P' a ' (G) and that Freq(P a ) < 
Freq(P' a '). □ 

The above lemma suggests the following definition of specialization among 
instantiated tree patterns: we say that (IIi,Ei,ai) is a specialization of 
(n 2 , £2, 012) if the parameterized tree pattern (IIi, £1) is a specialization of the 
parameterized tree pattern (II 2 , £2), and a 2 = o;i|e 2 - 

Intuitively, the previous lemma then expresses that the frequency of an in- 
stantiated tree pattern is always at most the frequency of any of its instantiated 
parents. 

4.4.1 Candidate generation 

Candidate pattern A candidate pattern is an instantiated tree pattern whose 
frequency is not yet determined, but all whose generalizations are known to be 
frequent. 

Using the specialization relation and the definition for a candidate pattern 
we explain how the levelwise search for frequent instantiated tree patterns will 
go. 

Levelwise search We start with the most general instantiated tree pattern 
P = (0,0,0), and we progressively consider more specific patterns. The search 
has the typical property that, in each new iteration, new candidate patterns 
are generated; the frequency of all newly discovered candidate patterns is de- 
termined, and the process repeats. 

There are many different instantiations to consider for each parameterized 
tree pattern. Hence, to generate candidate patterns in an efficient manner, we 
propose the use of candidacy tables and frequency tables. These candidacy and 
frequency tables allow us to generate all frequent instantiations for a particular 
parameterized tree pattern in parallel. A frequency table contains all frequent 
instantiations for a particular parameterized tree pattern. 

Formally, for any parameterized pattern P = (II, £), we define: 

CanTabn,s — {a \ P a is a candidate instantiated tree pattern} 
FreqTab n Ti = {a | P a is a frequent instantiated tree pattern} 

Technically, the table has columns for the different parameters, plus a column 
f req. Note that when £ = 0, i.e., P has no parameters, this is a single-column, 
single-row table containing just the frequency of P. This still makes sense and 
can be interpreted as boolean values; for example, if FreqTab n $ contains the 
empty tuple, then the pattern (n, 0, 0) is frequent; if the table is empty, the 
pattern is not frequent. Of course in practice, all frequency tables for param- 
eterless patterns can be combined into a single table. All frequency tables are 
kept in a relational database. 

The following crucial lemma shows these tables can be populated efficiently. 
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Join Lemma. A parameter assignment a is in CanTabux if and only if the 
following conditions are satisfied for every parent (II', S') 0/ (II, E): 

(i) If 11 = IT, then a\ S ' 6 FreqTab n , s , ; 

(ii) If E = E', iften a 6 FreqTab-^, S /. 

Proof. For the 'only-if direction: By definition of a candidacy table, if a G 
CanTabu^, then all generalizations of (II, E, a) are frequent. In particular, 
for all parents (II', E') of (II, E), we know that (II', E', a|s') is frequent, since 
parents are generalizations. 

For the 'if direction, we must show that all generalizations of (II, E, a) 
are frequent. Consider such a generalization (II gi , E gi , a|s s ). Let us de- 
note the parent relation by > p . Then there is a sequence of parent patterns: 
(n ffl ,E 9l ) > p (n 92 ,E 92 ) > p ... > p (n',E'). And we have: Freq(U gi , E fll , a| E<1 ) 
> Freq(Jlg 2 , E 92 , a|s 93 ) > ... > Freqiji' ,YJ ,a\s') > minsup. The last inequality 
is given by (i) or (ii), the other inequalities are given by Lemma [3] □ 

The Join Lemma has its name because, viewing the tables as relational 
database tables, it can be phrased as follows: 

Each candidacy table can be computed by taking the natural join of 
its parent frequency tables. 

The only exception is when II = and E = {z} is a singleton; this is the 
initial iteration of the search process, when there are no constants in the parent 
tables to start from. In that case, we define CanTab%j^ z y as the table with a 
single column z, holding all nodes of the data graph G being mined. 

4.4.2 Frequency counting using SQL 

The search process starts by determining the frequency of the underlying tree 
T = (0,0); indeed, formally this amounts to computing FreqTab^ g. Similarly, 
for each parameterized tree pattern P = (II, 0) with II 7^ 0, all we can do is 
determine its frequency, except that here, we do this only on condition that its 
parent patterns are frequent. 

We have seen above that, if the frequency tables are viewed as relational 
database tables, we can compute each candidacy table by a single database 
query, using the Join Lemma. Now suppose the data graph G that is being 
mined is stored in the relational database system as well, in the form of a table 
G(from,to). Then also each frequency table can be computed by a single SQL 
query. 

Indeed, in the cases where E = this simply amounts to formulating the 
pattern in SQL, and determining its count (eliminating duplicates). Since our 
patterns are in fact conjunctive queries (or datalog rules) known from database 
research [2"1 140] . They can easily be translated in SQL: 

• The FROM-clause consists of all table references of the form G as Gi j , for 
all edges Xi — >• Xj in T. 

• The WHERE-clause consists of all equalities of the form Gij .from = 
Gik.from as well of equalities of the form Gij .to = Gjh.from. 
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Figure 10: Illustration on translating a tree pattern without parameters in SQL. 

• The SELECT-clause is of the from SELECT DISTINCT and consists of all col- 
umn references of the form Gi j .to when distinguished node in P, 
plus one reference of the form Glk.from if the root node is distinguished. 

The SQL query for the tree in Figure [TU] with II = {£2} ancl E = is as follows: 

E = SELECT G12.from, G23.to, G24.to 

FROM G as G12, G as G23, G as G24 

WHERE G12.to = G23.from AND G12.to = G24.from 

But also when S ^ 0, we can compute FreqTab u s by a single SQL query. 
Note that we thus compute the frequency of a large number of instantiated tree 
patterns in parallel! We proceed as follows: 

1. we formulate the pattern (II, 0) in SQL; call the resulting expression E 

2. We then take the natural join of E and CanTabu,^:, group by E, and count 
each group. 

The join with the candidacy table ensures that only candidate patterns are 
counted. 

For instance, the SQL query to compute the frequency table for the tree in 
Figure [lOj with II = {2:2} and E = {x±, £3}, with E as above, is as follows: 

SELECT E.xl, E.x3, COUNT (*) 

FROM E, CanTab {x2}dxuX3} CT 

WHERE E.xl = CT.xl AND E.x3 = CT.x3 

GROUP BY E.xl, E.x3 HAVING C0UNT(*) >= minsup 

It goes without saying that, whenever the frequency table of a tree pattern 
is found to be empty, the search for more specialized patterns is pruned at that 
point. 

4.4.3 The algorithm 

Putting everything together so far, the algorithm is given in Algorithm [TJ In 
outline it is a double Apriori algorithm [3] , where the sets II form one dimension 
of itemsets, and the sets E another. A graphical illustration of the algorithm is 
given in Figure 1111 In this illustration we use tries (or prefix-trees) to store the 
itemsets. A trie [SJ UJ 126) is commonly used in implementations of the Apriori 
algorithm. 
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Algorithm 1 Levelwise search for frequent tree patterns. 

1: for each unordered, rooted tree T do 

2: X := set of nodes of T 

3: p ■= 0; V := {0} 

4: repeat 

5: for each II G V v do 

6: Compute FreqTab n 9 in SQL 

7: if FreqTab m ^ then 

8: s := 1 

9: Si := {{z}\ z e X -n} 

10: repeat 

11: for each S e 5 S do 

12: if p = and s = 1 then 

13: CanTabu.s ■= set of nodes of G 

14: else 

15: CanTabu.x. ■= M {FreqTab n , s , | (IT, £') parent of (II, £)} 

16: end if 

17: Compute FreqTab u E in SQL 

18: if FreqTab nj2 = then 

19: remove £ from <S S {£ is pruned away} 

20: end if 

21: end for 

22: S s+1 :={SCI-n|#S = s + l 

23: and each s-subsct of S is in S 8 } 

24: s := s + 1 

25: until S s = 

26: else 

27: remove II from V p {II is pruned away} 

28: end if 

29: end for 

30: Fp+i ■— {n C X I #11 = p + 1 and each p-subset of II is in Pp} 

31: p := p + 1 

32: until V p = 

33: end for 
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4.4.4 Example run 

In this Section we give an example run of the proposed algorithm in Algorithm[T] 
Consider the example data graph G in Figure 12(a)| the unordered rooted tree 
T in Figure 12(b)| and let the minimum support threshold be 3. 
The example run then looks as follows: 
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4.5 Equivalence among Tree Patterns 

In this Section we make a number of modifications to the algorithm described 
so far, so as to avoid duplicate work. 

As an example of duplicate work, consider the parameterized tree pattern 
Pi from the example run in Section 14.4.41 (II = {2:2} and E = 0): 

xi 
* \ 

and the parameterized tree pattern P 2 : 

Xi 

Clearly, Pi and P2 have the same answer set for all data graphs G, up to 
renaming of the distinguished variables (x2 by X3). However, these patterns 
have different underlying trees, and hence Algorithm [T] will compute the answer 
set for both patterns (line 6). The answer set of Pi is computed before the 
answer set of P2, since our algorithm is incremental in the number of nodes of 
T. Hence, we can conclude that our algorithm performs some duplicate work 
which we want to avoid. 

Another example of duplicate work our algorithm performs: Consider the 
parameterized tree pattern P3 from the example run in Section [4.4.41 fn = {xi} 
and E = {^2}): 

3 

/ \ 

and the parameterized tree pattern P4 also from the example run in Section r4.4.4l 
(n = {xi} and E = {x 3 }): 

3 

/ \ 

X2 CT 3 
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As one can see in Section l4.4.41 these two parameterized patterns have the same 
instantiations for all data graphs G, up to renaming of the parameters (02 by 
03), and for each instantiantion, the same answer set for all data graphs G, up 
to renaming of the distinguished variables (2:3 by £2). However, when we look 
at the outline of our algorithm in Algorithm [TJ we see that for both patterns 
the candidacy and frequency tables are computed between line 11 and line 17. 
Hence, we can conclude again that our algorithm performs duplicate work that 
we want to avoid. 

In the rest of this Section we formalize the duplicate work our algorithm 
performs, and we make a number of modifications to the algorithm described 
so far, so as to avoid the duplicate work. 

4.5.1 Equivalency 

Intuitively we call two parameterized tree patterns equivalent if they have the 
same answer sets and the same parameter assignments for all data graphs G, 
up to renaming of the parameters and the distinguished variables. For instance, 
the parameterized tree patterns Pi and P2 from above we call equivalent, as the 
tree patterns P3 and P4 from above. 

To define equivalent parameterized tree patterns formally we introduce the 
notion of (5, p) -equivalence. 

(8, /^-Equivalence Let Pi and P2 be two parameterized tree patterns and 
p a parameter correspondence from Pi to P2 (recall Section T3.3p . We define 
an answer set correspondence from Pi to P2 as any mapping 5 : Ai — > A2. 
Furthermore, assume that 6 and p are bijections. We then say that Pi and 
P2 are (5, p)- equivalent, denoted by Pi = s p P2, if for all data graphs G, and all 
parameter assignments a.2 ■ £2 — > U, we have P^ 2 (G) 06 — P 1 a2 ° p (G), where 
P 2 " 2 (G) o 5 denotes the set {/ o 5 : f G P 2 Q ' 2 (G)}. 

For example, consider the two parameterized tree patterns in Figure fT3l and 
let p : S a — > E;, be as follows: 



p 








03 


0-3 


0~2 



and let 6 : A a — > A& be as follows: 



5 


Xx 


X3 


X2 


X\ 


X3 


X2 



The two parameterized tree patterns are clearly (5, p)-equivalent, as are the 
three parameterized tree patterns shown in Figure with an empty parameter 
correspondence p and S the identity. 

The parameter correspondence p is a bijection in the definition of p-equivalen- 
ce, since intuitively we want equivalent parameterized tree patterns to have es- 
sentially the same set of instantiations. Hence it is necessary that the tree pat- 
terns have the same number of parameters. Intuitively we also want equivalent 
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C7l (7l 

3 x 2 xi 3 

/ \ / \ / \ / \ 

<T 2 Xi (T 3 X 3 X 2 <J 2 X 3 <T 3 

(a) (b) 

Figure 13: Two equivalent parameterized tree patterns. 
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(a) (b) (c) 



Figure 14: Three equivalent parameterized tree patterns. 



tree patterns to have the same answer sets up to renaming of the distinguished 
variables. That is the reason why an answer set correspondence is introduced 
that is a bijection. 

We then define equivalent parameterized tree patterns as follows: 



Equivalent parameterized tree patterns We call two parameterized tree 
patterns Pi and P 2 equivalent if Pi is (S, ^-equivalent with P 2 for some bijective 
parameter correspondence p and some bijective answer set correspondence 6. 

Note that there can exist more than one parameter correspondence p and 
more than one answer set correspondence S for which the two parameterized 
tree patterns are (5, p)-equivalent. An illustration of this is given in Figure [TS] 
Let pi : S a — > Si : A a — > Ab, p 2 : S a — > and (52 : A a — > be as follows: 
pi is the identity; Si is the identity and 



P2 


<Tl 




o 2 


01 



5 
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Xi 


X\ 


x 2 


X4 


X3 


x 5 


X4 


x 2 


x 5 


X3 



Then the two tree patterns in Figure [TS] are clearly (5± , pi )-equivalent and 
(^2, P2)-equivalent. 

Equivalence as just defined is a semantical property, referring to all possible 
data graphs, and it is not immediately clear how one could decide this property 
syntactically. The required syntactical notion is given by the following Lemma 
and Corollary. 

Lemma 4. Consider two parameterized tree patterns P\ and P 2 , 5 : Ai — > A 2 
a bijective answer set correspondence, and p : Si — > E 2 a bijective parameter 



2G 




Xl 

X 2 X4 
/ \ / \ 

ai x 3 a 2 x 5 



Figure 15: Two equivalent parameterized tree patterns with more than one 
possibility for a parameter and answer set correspondence. 



correspondence. Then Pi = s P 2 if and only if we have the following containment 
relations among the tree queries (Hi, Pi) and (6 (Hi), P 2 ), with Hi the pure head 
of Pi (cfr. Section \4. 

1. (5(Hi),P 2 ) C p (Hi, Pi); and 

2. (Hi, Pi) C p -i (S(Hi),P 2 ) 

Proof. Let us start with the if direction. We need to prove that for every param- 
eter assignment a 2 for P 2 , and every data graph G that P£ 2 (G) o 8 = P" 2 ° P (G). 
We know that (S(Hi), P 2 ) a2 (G) C (Hi, Pi) a2 ° p (G) since (S(Hi),P 2 ) C p (Hi, Pi). 
We may rewrite this as: P^ 2 (G) o 8 C P" 2 ° P (G) since Hi is an enumeration of 
A,. 

We also know that (Hi,Pi) ai (G) C (8(Hi), P 2 ) ai °P 1 (G) for every parame- 
ter assignment ai for Pi since (Hi, Pi) C.-i (8 (Hi), P 2 ). Now take ai — a 2 o p. 
We then have (Hi, Pi) a2 °P(G) C (8(Hi),P 2 ) a2 (G). Again since Hi is an enu- 
meration of Ai we may rewrite this as: P" 2 ° P (G) C P^ 2 (G) o 8. Hence we can 
conclude that P 2 " 2 (G) o 8 = P 1 Q2 ° P (G). 

Let us then look at the only- if direction. To prove that (8(Hi), P 2 ) C p 
(Hi, Pi), we will show that for every a 2 parameter assignment for P 2 , and every 
data graph G, we have (8(Hi),P 2 ) a2 (G) C (H u Pi) a2 °P(G). Since P 2 2 (G) o 
8 = P 1 Q20p (G), we have (S(Hi), P 2 ) a2 (G) = (Hi, Pi) a2 °P(G), and hence clearly 
(<5(P 1 ),P 2 )" 2 (G) C (Hi,Pi) a2 °P(G). 

To prove that (Hi, Pi) C„-i (8(Hi),P 2 ), we will show that for every ai 
parameter assignment for Pi, and every data graph G, we have (Pi, Pi)" 1 (G) C 
(S(Hi), P 2 ) ai ° p 1 (G). We know that for every a 2 parameter assignment for P 2 , 
we have P 2 2 (G) o 8 = P 1 Q2 ° P (G). Now take a 2 = ai o p' 1 . We then have 
P 2 " l0p (G) oS = Pi Ql (G), hence (8(H X ), P^p' 1 (G) = (Pi, Pi)" 1 (G), hence 
clearly (Pi,Pi) Ql (G) C (8(Hi),P 2 ) a ^ (G). □ 

Corollary 1. Consider two parameterized tree patterns Pi and P 2 . Then Pi is 
equivalent with P 2 if and only if there exist: 

1. a bijective answer set correspondence 8 : Ai — > A 2 ; 

2. a bijective parameter correspondence p : Si — > Y, 2 ; 

3. a p-containment mapping fi : (Hi, Pi) — > (8(Hi), P 2 ); and 
4- a p^ 1 -containment mapping f 2 : (8(Hi), P 2 ) (Hi, Pi). 

with Hi the pure head for Pi . 
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Of course, we want to avoid that our algorithm considers some parameterized 
tree pattern Pi if it is equivalent to an earlier considered parameterized tree 
pattern P\. Since our algorithm generates trees in increasing sizes, there are 
two cases to consider: 

Case A: P\ has fewer nodes than Pi. 

Case B: P\ and Pi have the same number of nodes. 

Armed with the above Lemma and Corollary, we can now analyze the above 
two cases. 

4.5.2 Case A: Redundancy checking 

Let us start by defining the notion of a redundancy. 

Redundant subtree A redundant subtree R, is a subtree of a parameterized 
tree pattern P, such that removing R from P yields a parameterized tree pattern 
P' that is equivalent with P. 

For example, the first two parameterized tree patterns in Figure H4l indeed 
contain a redundant subtree. 

The following lemma shows that two parameterized tree patterns with differ- 
ent numbers of nodes can only be equivalent if the largest one contains redundant 
subtrees: 

Lemma 5. Consider two parameterized tree patterns P and P' , and the number 
of nodes of P 1 is smaller than the number of nodes of P. Then P can only be 
equivalent with P' if P contains redundant subtrees. 

Proof. Since P and P' are equivalent we know from Corollary [1] that the fol- 
lowing exist: 

1. an answer set correspondence 5 : A — ► A' that is a bijection; 

2. a parameter correspondence p : S — > £' that is a bijection; 

3. a p-containment mapping f\ : (H,P) — > (6(H), P r ); and 

4. a p _1 -containment mapping fi : (5(H), P') — >• (H,P). 

with H the pure head for P. Since the number of nodes of P' is smaller than 
the number of nodes of P, we know that some subtrees R of P are not in the 
range of fi. We will prove that these subtrees R are redundant subtrees, by 
showing that P and P — R are equivalent. 

Since the containment mappings f\ and fi exist, we know that in particular 
the following containment mappings will exist: 

1. g± = fi\p—R, a p-containment mapping from (H,P — R) to (5(H), P'), 
and 

2. gi — fi, a ^containment mapping from (5(H), P') to (H,P — R). 
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Figure 16: A tree pattern that contains a linear chain of existential nodes that 
is redundant. 



with S and p as above. 

Let us now look at the following mappings: hi = gi o fx and h% = f% o g\. 
By Lemma [TJ hx—g^o f\ and hi — fi g\ are identity-containment mappings. 

Using Corollary [T] we can now conclude that P and P — R are (identity, 
identity)-equivalent and thus R is a redundant subtree. 

□ 

From this lemma follows that Case A can only happen if Pi contains redun- 
dant subtrees. Hence, if we can avoid redundancies, Case A will never occur. 
The following lemma provides us with an efficient check for redundancies. 

Redundancy Lemma. Let P be a parameterized tree pattern. Then P has 
a redundancy if and only it contains a subtree C in the form of a linear chain 
of existential nodes (possibly just a single node), such that the parent of C has 
another subtree that is at least as deep as C . 

Before we prove this Lemma, let us see some examples. For instance the 
parameterized tree patterns in Figure 14(a) and Figure [l4(b)| contain a linear 



chain of existential nodes that is redundant. In both tree patterns this lin- 
ear chain is rooted in 3i. Another example of such a redundancy is given in 
Figure [16] Here the linear chain is rooted in 3 3 . Note that when we remove 
the linear chain rooted in 33, we have a new linear chain rooted in 3i that is 
redundant. 

Proof. Let us refer to a subtree C as described in the lemma as an "eliminable 
path" . An eliminable path is clearly redundant, so we only need to prove the 
only-if direction. Let T be a redundant subtree of P that is maximal, in the 
sense that it is not the subtree of another redundant subtree. Then following 
Corollary[TJ there must be a p-containment mapping h from (H, P) to (8(H), P— 
T) with p and 5 bijections and H the pure head for P. All distinguished variables 
of P must be in P — T, since 5 is a bijection. Also all parameters of P must be 
in P — T, since p is also a bijection. So T consists entirely of existential nodes. 

Furthermore, note that h must fix the root of P, since the height of P is at 
least that of P - T. 

Any iteration h n of h is a ^"-containment mapping from (H, P) to (5 n (H), P— 
T) by Lemma [T] Moreover, each /i"|aus induces a permutation on the set 
A U S of distinguished variables and parameters. Since A U E is finite, there 
are only a finite number of possible permutations of A U E, namely |A U E|L 
Hence, there will be an iteration /i fc |Aus and an iteration /i^ + ^|aue such that 
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^^Iaus = ^ fe+ ^|Aus- Hence, /i'|aue is the identity on A U E, because 
id| A us = (^YUus ° h k \ Au x 



= (0*lAuE°>i (fc+i) |AuE 
= ^'|aus 

There are now two possible cases. 



1. T itself is a linear chain. Let us then look at the parent p of T in P. Again 
there are two possibilities: 

(a) h (p) = p: Since h is a p'-containment mapping from (if, P) to 
(<5 ( (if), P — T) and T is redundant, we know that T must be mapped 
to another subtree of p, T", that is at least as deep as T. Hence, T 



is an eliminable path. An illustration is given in Figure 17(a) 

(b) h l (p) ^ p: Then p can only be an existential node. We now have two 
possibilities: 

i. T is the only subtree of p. We will show that the subtree T", 
rooted in p is redundant as well. Clearly we have the following 
containment relations: 

• h\ = h, a p -containment mapping from (H, P) to (5 l (H), P— 
T'); and 

• hi — h , a p -containment mapping from (S l (H),P — T') 
to (H,P). 

with S and p as above. By Corollary [TJ T" is then a redundant 
subtree. This is in contraction with the assumption that T is 
maximal. Hence, it is impossible that p has only one subtree and 
p is existential. An illustration of this is in given in Figure [17(b)] 

ii. p has more than one subtree. Consider such another subtree 
T 1 . We will show that all subtrees T 1 of p consist entirely of 
existential nodes. Suppose a node n G T' is not an existential 
node. We then know that h (n) = n. However, since h l is a 
homomorphism and p is an ancestor of n, h l (p) must be p. But 
this is in contradiction with the assumption that h l (p) ^ p. So 
T" must consist entirely of existential nodes. Hence this brings us 
to the second case where T is not a linear chain. An illustration 



is given in Figure 17(c) 



T is not a linear chain. An easy induction on the height of T, shows that 
any non-linear tree consisting entirely of existential nodes must contain 
an eliminable path. If the height of T is 1, there is an eliminable path of a 
single node: just choose one of the children of the root. If the height of T 
is n > 1, consider the subtree S of the root of T with the smallest height, 
at most n — 1. Then we have two possibilities: If S is a linear chain, we 
found our eliminable path. And if S is a non-linear chain we know by 
induction that S will contain an eliminable path. Hence, T, and thus also 
P, contains an eliminable path as desired. 

□ 
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(a) (b) 




(c) 

Figure 17: Figures to illustrate the proof of the Redundancy Lemma. 
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As we have seen in Section 14.41 our algorithm introduces existential nodes 
levelwise, one by one. This makes the redundancy test provided by the redun- 
dancy lemma particularly easy to perform. Indeed, if (II, S) is a parameterized 
tree patterns of which we already know it has no redundancies, and we make 
one additional node n existential, then it suffices to test whether n thus becomes 
part of a subtree C as in the Redundancy Lemma. If so, we will prune the entire 
search at LT U {n}. 

4.5.3 Case B: Canonical forms 

We may now assume that Pi and P2 do not contain redundancies, for if they 
would, they would have been dismissed already. 

Let us start by defining isomorphic parameterized tree patterns. 

Isomorphic Parametrized Tree Patterns We call two parameterized tree 
patterns P\ and Pi isomorphic if there exists a homomorphism / : Pi — > P2 
that is a bijection and that maps distinguished nodes to distinguished nodes, 
parameters to parameters and existential nodes to existential nodes. We call / 
an isomorphism. Since we are working with trees, / _1 is also a homomorphism. 

For example, the two parameterized tree patterns in Figure [T3] are indeed 
isomorphic with / as follows: 



/ 






<J\ 
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0-2 




0-3 


Xl 




X3 


X2 




Xl 


0-3 




0-2 


X3 




X2 



Clearly, we have the following: 

Property 1. Any two isomorphic parameterized tree patterns Pi and P2 are 
equivalent. 

Proof. Using Corollary Q] we have to show that the following exists: 

1. a bijective answer set correspondence 8 : Ai — > A2; 

2. a bijective parameter correspondence p : Si — > £2; 

3. a p-containment mapping fi : (Hi, Pi) (8(Hi), P2); and 

4. a p _1 -containment mapping f% : (S(Hi), P2) — > (Hi, Pi). 

with Hi the pure head for Pi. 

Since Pi and P2 are isomorphic, there exists a homomorphism / : Pi — » P2 
that is a bijection and that maps distinguished nodes to distinguished nodes, 
parameters to parameters and existential nodes to existential nodes. 

We now take 8 = /|ai and p = f\y, 1 . Then 8 and p are bijections since / is 
a bijection. 

For (3) we will show that / is p-containment mapping from (Hi , Pi) to 
(S(Hi),P 2 ), with Hi the pure head for Pi: 
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• / is a homomorphism; 

• / maps distinguished nodes to distinguished nodes and /|ai = S; 

• / maps parameters to parameters and f\s 1 = p; and 
. f{H\) = 5(H X ). 

For (4) we will show that / _1 is p~ ^containment mapping from (S(Hi), P 2 ) 
to (Hi, Pi), with Hi the pure head for Pi: 

• / _1 is a homomorphism since / is a bijection and we are working with 
trees; 

• / _1 maps distinguished nodes to distinguished nodes and / _1 |a 2 = 

• / maps parameters to parameters and / |e 2 = P j an d 
. f-\5{Hi))=Hi. 

□ 

The following lemma shows that two parameterized tree patterns without 
redundancies and with the same number of nodes can only be equivalent if they 
are isomorphic. 

Isomorphism Lemma. Consider two parameterized tree patterns Pi and P2 
without redundancies, and with the same number of nodes. Then P x and Pi are 
equivalent if and only if Pi and P2 are isomorphic. 

Proof. We only need to show the only-if direction. 

Since P x and P2 are equivalent we know that the following exists by Corol- 
lary [TJ 

1. a bijective answer set correspondence 5 : A^ — > A2; 

2. a bijective parameter correspondence p : £1 — > £2; 

3. a p-containment mapping fi : (Hi, Pi) — > (6 (Hi), P2); and 

4. a ,o - ^containment mapping fi : (5(Hi), P2) — > (Hi, Pi). 

with Hi a pure head for Pi. 

We also know that Pi and P2 have the same number of existential nodes 
since Pi and P2 have the same number of nodes and p and S are bijections. 

Hence to prove that Pi and P2 are isomorphic, we only need to show that: 

1. fi maps existential nodes to existential nodes; and 
2- /i|rii is a bijection. 

Thereto, it suffices to prove that fi is surjective on the existential nodes of 
P2, because /1 is already a bijection from Ai U £1 to A 2 U £2- 

Assume that /i|rii is not surjective. Hence, there will be some existential 
nodes p € n 2 that are not in the range of fi . Note that these existential nodes p 
can never have descendants that are parameters or distinguished nodes since fi 
is a homomorphism and 5 and p bijections. Now fix some p as high as possible 
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in the tree. Then the entire subtree R rooted in p consists entirely of existential 
nodes, that are not in the range of f\. We will now show that this subtree R is 
a redundant subtree in Pi- We then have a contradiction since we assume that 
P2 is redundancy free. 

Since the containment mappings /1 and f 2 exist, we know that in particular 
the following containment mappings will exist: 

1. <?i = /1 a p- containment mapping from (Hi, Pi) to (d(Hi), P 2 — R), and 

2. g 2 = f2\p 2 ~R a ^containment mapping from (5{Hi), P2—R) to (Hi, Pi). 

with d and p as above. 

Let us now look at the following mappings hi = fi o g 2 and hi = gi o f 2 . By 
Lemma [I] hi and h 2 are identity-containment mappings. Using Corollary [TJ we 
can now conclude that P 2 and P 2 — R arc equivalent, and thus R is a redundant 
subtree. □ 

From the above lemma it follows that Case B can only happen if Pi and 
P 2 are actually isomorphic. In particular, Pi and P 2 have the same underlying 
tree. 

So, in our algorithm, we need an efficient way to avoid isomorphic parame- 
terized tree patterns based on the same tree T. 

Fortunately, there is a standard way to do this, by working with canonical 
forms of parameterized tree patterns. Consider a pair (LI, £), as in Section [4.41 
We can view this pair as a labeling of T: all nodes in n get the same generic 
label '3'; all nodes in E get 'a'; and all distinguished nodes get 'x'. We then 
observe that the patterns (IIi, £1) and (IL2, £2) are isomorphic iff there is a tree 
isomorphism between the corresponding labeled versions of T that respects the 
labels. 

In order to represent each pair (LI, E) uniquely up to isomorphism, we can 
rather straightforwardly refine the canonical ordering of the underlying unla- 
beled tree T, which we already have fSection l4.3l) . to take into account the node 
labels. Furthermore, the classical linear-time algorithm to canonize a tree [3] 
generalizes straightforwardly to labeled trees. A nice review of these general- 
izations has been given by Chi, Yang and Muntz [10 . 

We will omit the details of the canonical form; in fact, there are several ways 
to realize it. All that is important is that we can check in linear time whether 
a pair is canonical; that a pair can be canonized in linear time; and that two 
pairs are isomorphic if and only if their canonical forms are identical. 

Example. We can refine the level sequence introduced in Section [4.3l to a refined 
level sequence for parameterized tree patterns P as follows: if the tree pattern 
P consists of n nodes, then the refined level sequence is now a sequence of n 
elements, where the ith element is the depth of the ith node in preorder in 
the pattern, followed by a 'd' if the node is distinguished; followed by a V is 
the node is existential and followed by a 'p' if the node is a parameter. The 
canonical ordering of a parameterized tree pattern P, is then the ordering of 
P that yields the lexicographically maximal refined level sequence, among all 
orderings of P. Then the refined level sequences for the parameterized tree 
patterns in Figure [13] are: 

(a) 0ple2p2dld2p2d 
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Xl Xl Xl 

/ \ / \ / \ 

01 X 2 X 2 X 3 X 2 X 3 

♦ j I | \ I 

3 (72 3 <7l <7l 3 

(a) (b) (c) 



Figure 18: The tree pattern in (b) is a parent of a tree pattern in (a), and (c) 
is the canonical version of (b) . 



(b) 0pld2d2ple2d2p 



and (a) is the canonical one. 

Armed by the canonical form, we are now in a position to describe how the 
algorithm of Section 14.41 must be modified to avoid equivalent parameterized 
tree patterns. First of all, we only work with patterns (II, E) in canonical 
form; the others are dismissed. However, the problem then arises that a parent 
pattern (IT, E'), where we omit a variable from either II or E as described in 
Section 14^41 might be non-canonical. In that case the frequency table for (II', E') 
will not exist. We can solve this by canonizing (IT, E') to its canonical version 
(II",E"), and remembering the renaming of variables this entails. The table 
FreqTab U " jy/ can then serve in place of FreqTab-^, s , , after we have applied the 
inverse renaming to its column headings. 

This does not completely solve the problem, however. Indeed the frequency 
table of (II",E") might not yet have been computed. For example, consider 
the parameterized tree pattern in Figure 18(a) and one of its parents in Fig- 
18(b)| The canonical version of this parent, using the canonical ordering 



urc 



from the previous example, is shown in Figure 18(c) Using the current order 



for computing the frequency tables as in Algorithm [1] the frequency table for 



the pattern in Figure 18(c) is not yet computed, when we want to compute the 



frequency for the pattern in Figure 18(a) 



We can solve this by changing the order in which we compute the frequency 
tables. We work with increasing levels: in level i we compute the frequency 
tables for all pairs (II, E), where #11 + #E = i. If we use this order, we are sure 
that when we compute the frequency table of a pair (II, E), all frequency tables 
of pairs (II', E') with (#IT + #E') < (#11 + #E), have been computed. 



4.5.4 The Algorithm 

The final algorithm is now given in Algorithm [2l The outline for canonizing a 
parameterized tree pattern in given in Function [3] 



4.5.5 Example run 

In this Section we give an example run of the final algorithm in Algorithm [5] 
We use the same data graph G; unordered rooted tree T; and minimum support 
threshold, 3, as in the example in Section 14.4.41 



35 



Algorithm 2 Levclwise search for frequent tree patterns 
with equivalence checking. 

1: for each unordered, rooted tree T do 

2: level := number of nodes of T 

3: i:=0 

4: C o :={(0,0)};F:=0 

5: while i < level AND C t ^ do 

6: {Candidate evaluation} 

7: for each pattern (II, E) in d do 

8: if S = then 

9: Compute FreqTabjufi in SQL 

10: else 

11: if (#E = 1 AND #11 = 0) then 

12: CanTabn,j: ■= set of nodes of G 

13: else 

14: CanTab n ,n := X {/^(FreqTab^ iE „) | (IT, E') parent of (n, E) 

15: and (/, (n", E")) = Canonize(n', E')} 

16: end if 

17: end if 

18: Compute FreqTab n j2 in SQL 

19: if (FreqTab n s ^ 0) then 

20: F = fU{(n,E)} 

21: end if 

22: end for 

23: {Candidate generation} 

24: C 4+ i = {(n, E)| all parents (IT, E') of (n, E) are in F} 

25: {Equivalence Check} 

26: for each pattern (II, E) in Cj + i do 

27: if ((II, E) contains a redundancy) then 

28: remove (II, E) from Ci+i 

29: else if ((LT, E) is not canonical) then 

30: remove (LT, E) from C i+ i 

31: end if 

32: end for 

33: i := i + 1 

34: end while 

35: end for 



Function 3 Canonize (LT', E') based on T 
1: (LT C , E c ) := canonization of (LT', E') based on T 
2: / := isomorphism from (II C ,E C ) to (II', E') 
3: return (/, (II C ,E C )) 
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Note that there are two important differences between this run and the run 
in Section UX3J 

1. duplicate work is avoided: equivalent tree patterns are not generated; and 

2. the order for computing the tree patterns is different in the sense that here, 
the tree patterns are generated in levels, as explained in Section [4.5.31 

The example run then looks as follows: 



CanTab 



FreqTab 



Level 



/ \ 

X2 X3 



Freq 
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Level 1 





Freq 



2 


9 
4 



01 

/ \ 

X2 X3 



nodes of G 



xi 
/ \ 

<T2 X3 



nodes of G 



0"2 


Freq 
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3 


3 


4 


3 



{*3> 



XI 
/ \ 

X2 cr 3 



Equivalent with (0, {2:2}) 



{*l} 



A 

X2 X 3 



Freq 



14 



{^2} 



/ \ 

3 x 3 



Redundancy 



{x 3 } 



/ \ 

X2 3 



Equivalent with ({0:2}, 0) 



Level 2 



01 


o"2 


Freq 





1 


3 





2 


3 





3 


3 



{xi,x 2 } 



01 

/ \ 

T2 %3 



FreqTab $i{x±} 
M FreqTab 0>{x2} 



{xi,X3} 



/ \ 

X2 cr 3 



Equivalent with (0, {x\ , X2}) 



{2:2,0:3} 



x\ 

/ \ 

a"2 a"3 



FreqTab 9{X2} 
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<T2 XZ 



FregTab 0{x2} 
N FreqTab {xi}m 



C2 
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3 



{*•!} 



{x 3 } 



X2 cr 3 



Equivalent with ({xi}, {^2}) 
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Equivalent with ({3:2}, {zi, £3}) 
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4.6 Result Management: Pattern Database 

When the algorithm is terminated, its final output consists of a set of frequency 
tables for each tree T that was investigated. All frequency tables are kept in a 
relational database that we call the pattern database. The pattern database is 
an ideal platform for an interactive tool for browsing the frequent queries. We 
developed such a browser called Certhia and discuss it in Section [6] 

The pattern database is also an ideal platform for tree query association 
mining as will be described in Section [5] 

5 Mining Tree-Query Associations 

In this Section we present an algorithm for mining confident tree-query associa- 
tions in a large data graph. Recall from Section 13.41 that a parameterized asso- 
ciation rule (pAR) is something of the form Q\ => p Q2, with Qi and Q2 param- 
eterized tree queries, p : £1 — > £2 a parameter correspondence, and Q2 C p Q 1 . 
An instantiated association rule (iAR) is a pair {Q\ => p Q2,a), with Q\ => p Q2 
a pAR and a : £2 — > U a parameter assignment for Q\ => p Q2- Also recall that 
the confidence of an iAR in a data graph G is defined as FreqiQ^ 2 ) /Freg(Q" 2 ° p ). 

The algorithm presented in this Section finds all iARs of the form (Qieft =^p 
<3right,o0 that are confident and frequent in a given data graph G for a given 
lhs Qioft • Before presenting the algorithm we first show that we do not need to 
tackle the problem in its full generality. 

5.1 Problem Reduction 

In this section we show that, without loss of generality, we can focus on the 
case where the given lhs tree query Qi ft is pure in the sense that was defined in 
Section 14.11 We will also show that this restriction can not be imposed on the 
rhs tree queries to be output. We also make a remark regarding "free constants" 
in the head of a tree query. 

Pure lhs's Assume that all possible variables (nodes of tree patterns) have 
been arranged in some fixed but arbitrary order. Recall then from Section |4~T1 
that we call a parameterized tree query Q = (H, P) pure when H consists of 
the enumeration in order and without repetitions of all distinguished variables 
of P. In particular H can not contain parameters. We call H the pure head for 
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Figure 19: Rule (a) has a non-pure lhs. Rule (b) is the purification of rule 
(a), and expresses precisely the same information. Rules (c) and (d) are two 
example instantiations. 



P. As an illustration, the lhs of rule (a) of Figure QUI is impure, while the lhs of 
rule (b) is pure. 

Consider the pARs in Figure 19(a) and Figure [T9(b)[ and their instantiations 
in Figure 19(c) and Figure 19(d) The rules in Fig ure |19(a) and Fig ure |19(c) 



have an impure lhs. If we apply the iARs in Figure |l9(c) and Figure 19(d) 



to 

the data graph G in Figure 4(a) both have the same frequency, namely 2, and 
the same confidence, namely 33%. Indeed, since the frequency of a tree query 
is in fact the frequency of its body, repetitions of distinguished variables in the 
head and the occurrence of parameters in the head do not change the frequency 
of a tree query. In fact the pAR in Figure |19(b)| is the purification of the pAR 
in Figure 19(a) the repetition of the distinguished variable £ 2 is removed from 
the heads, and the parameter a± is removed from the heads. 

Hence, a pAR with an impure lhs can always be rewritten to an equivalent 
pAR with a pure lhs, in such a way that all instantiations of the pAR with 
the impure lhs correspond to instantiations of the pAR with pure lhs, with the 
same confidence and frequency. Indeed, take a legal pAR Qi => p Q2 with Qi not 
pure. We know that Qi's head is mapped to Q 2 's head by some p-containment 
mapping. Hence, we can purify Q\ by removing all parameters and repetitions 
of distinguished variables from Qi's head, sort the head by the order on the 
variables, and perform the corresponding actions on Q2S head as prescribed by 
the p-containment mapping. 

We can conclude that it is sufficient to only consider pARs with pure lhs's. 
The rhs, however, need not be pure; impure rhs's are in fact interesting, as we 
will demonstrate next. 



Impure rhs's Consider the pAR in Figure 20(a) The rhs is impure since x 2 
appears twice in the head. The pAR expresses that a sufficient proportion of 
the matchings of the lhs pattern, are also matchings of the rhs pattern, which 
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Figure 20: (a) and (b) are pARs with impure rhs. (c) is an ill-advised attempt 
to purify (b) on the rhs. 

(xi,x 2 ) 

jxi,x 2 ) 

XI 

xi | 

{ £2 
x 2 | 

3 

Figure 21: A pAR with a pure rhs. 



is the same as the lhs pattern except that x 2 is equal to x 3 . Since the pAR has 
no parameters, we can identify it with its instantiation by the empty parameter 
assignment. The confidence is then: 



Ex de § x 

where m is the number of edges, x ranges over the nodes in the graph, and deg x 
is the outdegree of (number of edges leaving) x. Since m — deg 2, we show 
by an easy calculation that this confidence is much larger than 1/m: 

m Y^ x de S x 



d eg 2 x Y, x deg 2 x 
E^degz 



> 



(Ezdega; 
1 

Ezdegx 
1 



Hence, the sparser the graph (with the number of nodes remaining the same), 
the higher the confidence, and thus the pAR is interesting in that it tells us 
something about the sparsity of the graph. As an illustration, on the graph of 
Figure 4(a) the confidence is 0.4, but on the the graph of Figure [4(b)| it is 0.6. 
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Also consider the pAR in Figure |20(b)| Again the rhs is impure since its 
head contains a parameter. Create an iAR for this pAR with a = 02 H> 8. With 
confidence c, this iAR then expresses that a fraction of c of all edges point to 
node 8, which again would be an interesting property of the graph. 

The knowledge expressed by the above two example pARs cannot be ex- 
pressed using pARs with pure rhs's. To illustrate, the pAR of Figure [20(c) may 
at first seem equivalent (and has a pure rhs) to that of Figure [20(b)[ On second 
thought, however, it says nothing about the proportion of edges pointing to 02, 
but only about the proportion of nodes with an edge to 02 • 

Of course, we are not implying that pARs with pure rhs's are uninteresting. 
But all they can express are statements about the proportion of matchings of 
the lhs that can be specialized or extended to a matching of the rhs (another 
example is in Figure[5TJ which says something about the proportion of edges that 
can be extended); they cannot say anything about the proportion of matchings 
of the lhs that satisfy certain equalities in the distinguished variables. 



Free Constants Most treatments of conjunctive database queries [9[ [40] 
allow arbitrary constants in the head. In our treatment, a constant can only 
appear in the head as the value of a parameter. Fortunately this is enough. We 
do not need to consider "free" constants, i.e., constants not corresponding to a 
parameter value. To see this, first consider the possibility of free constants in 
the lhs. The same argument we already gave to assume that the lhs is pure can 
be used to dismiss this possibility. Next consider a constant in the rhs of an iAR 
(Qi =^p Q2,a), with Qi = (Hi, Pi) and Q 2 = (#2,^2) and Qi already pure. 
Then there must be a p- containment mapping f : Qi —> Q2, with / = H2, 
for the iAR to be legal. Hence, a constant a can only appear in H2 by one of 
the following two possibilities: 

1. a = a(f(o~)) — a(p(a)), with a 6 Hi] or 

2. a = a(f(x)), with x a distinguished variable in H±. 

However, in both cases a is not actually free, being equal to a parameter 
value. 



5.2 Overall Approach 

Given the inputs: G; Qi c ft = (-fficft, -Ftat); minconf; and minsup, an outline 
of our algorithm for the association rule mining problem is that of four nested 
loops: 

1. Generate, incrementally, all possible trees of increasing sizes. Avoid trees 
that are isomorphic to previously generated ones. The height of the gener- 
ated trees must be at least the height of the tree underlying fl e ft- (When 
enough trees have been generated, this loop can be terminated.) 

2. For each new generated tree T, generate all frequent instantiated tree 
patterns P a based on that tree. 

These first two loops are nothing but our algorithm for mining frequent tree 
queries as presented in Section 0] 
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3. For each parameterized tree pattern P, generate all containment mappings 
/ from Pi c ft to P. Here, a plain "containment mapping" is a p- containment 
mapping, as defined in Section IBTBl for some p. Note that p then equals 

/tleff 

4. For each /, generate the parameterized tree query Qright = (/(-^lcft), -P)> 
and all parameter assignments a such that (Qi c ft =>p <3right,ck) is fre- 
quent; and the confidence exceeds minconf. The generation of all these 
a's happens in a parallel fashion. 

This approach is complete, i.e., it will output everything that must be out- 
put. In proof, consider a legal, frequent and confident iAR (Qieft =>p Qright, ao), 
with Q r ight = (-Bright) -Pright)- The tree T is the underlying tree of P r i g ht; fright 
is a tree pattern P in loop 2; the containment mapping / in loop 3 is the po~ 
containment mapping that exists since the iAR is legal; -ff right is /(-Hicft); and 
a in loop 4 is cto. 

The reader may wonder whether loop 3 cannot be organized in a level- 
wise fashion. This is not obvious, however, since any two queries of the form 
((/l(lii e ft), P), a) and ((/2(JJi e ft), P), a) have exactly the same frequency, namely 
that of P a . Loop 4, however, is levelwise because it is based on loop 2 which is 
levelwise. 

As already mentioned, these first two loops are nothing but our algorithm for 
mining frequent tree queries as presented in Section @] As already explained in 
Section ^. 6[ in loop 2 we build up a structured database containing all frequency 
tables for all trees in loop 1. We call this database the pattern database. In fact 
these two loops should be regarded as a preprocessing step; once built up, this 
pattern database can be used to generate association rules. 

Hence, in practice an outline for our rule-mining algorithm is the following: 

1. Preprocessing step: Generate a pattern database D using the algorithm 
discussed in Section [4j Halt this algorithm when enough patterns are 
generated. 

2. Consider, in a levelwise order, each parameterized tree pattern P that has 
frequent instantiations in D, and such that the height of the underlying 
tree of P is at least the height of the underlying tree of P e ft- 

3. For each parameterized tree pattern P, generate all containment mappings 
/ from Pieft to P and let p be /|s lcft - 

4. For each /, generate the parameterized tree query Q — (/(Pq e f t ), P), and 
all parameter assignments a such that (Qi c ft Q, a) is frequent; and 
the confidence exceeds minconf. The generation of all these a's happens 
in a parallel fashion. 

We present loops 3 and 4 in detail in Sections 15.31 and 15.41 In Section 15.61 
we will show how our overall approach must be refined so that the generation 
of equivalent association rules is avoided. 

5.3 Generation of Containment Mappings 

In this section, we discuss loop 3, the generation of all containment mappings 
/ from P c ft to P. So, we need to solve the following problem: Given two 



43 



parameterized tree patterns Pi and P 2 , find all containment mappings / from 
Pi to P 2 . 

Since the patterns are typically small, a naive algorithm suffices. For a 
node x\ of P\ and a node x 2 of P2, we say that x\ "matches" x 2 if there is a 
containment mapping / from the subpattern of P\ rooted at x\ to the subpattern 
of P 2 rooted at x 2 such that f{x\) — x 2 . In a first phase, we determine for every 
node y of P 2 separately whether the root n of Pi matches y. While doing so, 
we also determine for every other node x\ of Pi, and every node x 2 below y 
at the same distance as x\ is from n, whether x\ matches x 2 . We store all 
these boolean values in a two-dimensional matrix Map. The function for filling 
in Map is given in function [4] In line 2 of this function we mean by il x\ i— > x 2 is 
legal" , that if X\ is a distinguished variable, then x 2 is a distinguished variable 
or a parameter; and if x\ is a parameter then £2 is a parameter, as prescribed 
by the definition of a p-containment mapping in Section 13.31 

Function 4 Function for filling in Map 
1: bool Filllnn(xi i Pi, x 2 G P 2 ) 

2: if xi H> 3-2 is legal then 
3: Match := true; 

4: for each child c\ of x\ from left to right do 
5: MatchChild := false; 

6: for each child c 2 of x 2 from left to right do 
7: MatchChild := MatchChild OR FilUn(ci, c 2 ) 

8: end for 

9: Match := Match AND MatchChild; 
10: end for 

11: Map[xi, x 2 ] := Match; 
12: return Match; 
13: else 

14: Map[xi, x 2 ] := false; 
15: return false 
16: end if 



This first phase compares every possible pair (x\,x 2 ), with x\ a node in Pi 
and x 2 a node in P 2 , at most once. Indeed, if x\ is at distance d from r*i, then 
xi will be compared to x 2 only during the matching of T\ with the node y that 
is d steps above x 2 in P 2 (if existing). We thus have an 0(ni x 712) algorithm, 
where n\ (n 2 ) is the number of nodes in Pi (P2). 

In a second phase, we output all containment mappings. Initially, by a 
synchronous preorder traversal of Pi and P2, we map each node of Pi to the 
first matching node of P2. We store this first mapping in a one-dimensional 
matrix Cm. In function [5] an outline for finding the initial containment mapping 
is given. 

In each subsequent step, we look for the last node x\ (in preorder) of Pi, 
currently matched to some node x 2 , with the property that x\ can also be 
matched to a right sibling X3 of x 2 , and now map x\ to the first such £3. The 
mappings of all nodes of Pi coming after x\ are reinitialized. Every such step 
takes time that is linear in n\ and n 2 . Of course, the total number of different 
containment mappings may well be exponential in n\ . An outline of this step 
is given in Function [51 
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Function 5 Function for finding the initial containment mapping 



1: Init(xi E Pi,x 2 6 P 2 ) 

2: Cm[xi] := x 2 ; 

3: for each child c\ of x\ from left to right do 

4: for each child c 2 of x 2 from left to right do 

5: if Map[ci, c 2 ] then 

6: Init(ci,c 2 ); 

7: Break; 

8: end if 

9: end for 

10: end for 



Function 6 Function for finding the other containment mappings 
1: bool Step(x e Pi) 

2: Found := false; 

3: for each child c from x from right to left do 

4: if Step(c) then 

5: Found :— true; 

6: Break; 

7: end if 

8: end for 

9: if Found then 

10: for each right-sibling z of c from left to right do 
11: p 2 ■= Cm[x]; 

12: for each child c 2 of p 2 from left to right do 

13: if Map[z, c 2 ] then 

14: Init(z, c 2 ) 

15: end if 

16: end for 

17: end for 

18: return true; 

19: else 

20: if x is the root of Pi then 
21: return false; 
22: else 

23: m := Cm[x]; 

24: for each right-sibling s of m from left to right do 

25: if Map[a;, s] then 

26: Init(x, s) 

27: Break; 

28: end if 

29: end for 

30: return true; 

31: end if 

32: end if 
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The complete outline for the generation of all containment mappings is given 
in Function [7] 

Function 7 Function for generating all containment mappings from Pi to P2 



1 


CjtenerateCm(Fi, 


2 


Initialize Map; 


3 


r*i := root of Pi; 


1 


for each X2 € Pi in preordcr do 


5 


FiHIn(n,a:2); 


6 


end for 


7 


for each node x-i £ P2 in preorder do 


8 


if Map[ri, then 


9 


Initialize Cm; 


10 


Init(ri,x 2 ) 


11 


repeat 


12 


Output Cm; 


13 


until not Step(ri) 


11 


end if 


15 


end for 



We can thus easily generate all containment mappings / from Pi c f t to P as 
required for loop 3 of our overall algorithm. Note, however, that in loop 4 these 
mappings are used to produce the head /(-ffieft) of query Q r i g ht- For Q r i g ht to 
be a legal query, this head must contain all distinguished variables of P. Hence, 
we only pass to loop 4 those / whose image contains all distinguished variables 
of P. 

5.4 Generation of Parameter Assignments 

In loop 4, our task is the following. Given a containment mapping / : Pi f t — > P, 
let p = /|s loft , and generate all parameter assignments a such that (Qi e ft =>p 
(/(.ffieft), P), a) is frequent and confident in G. We show how this can be done 
in a parallel database-oriented fashion. 

Recall from Section POfl that the frequency tables for Pi e f t and P are available 
in a relational database. Our crucial observation is that we can compute pre- 
cisely the required set of parameter assignments a, together with the frequency 
and confidence of the corresponding association rules, by a single relational 
algebra expression. This expression has the following form: 

TTpferf a wp-fry. >mincmf (FreqTabp M g FreqTabp) 

FreqTabp^ .trcq — J 

Here, tt denotes projection, <r denotes selection, and IX denotes join. The join 
condition 9 and the projection list plist are defined as follows. For 9, we take 
the conjunction: 

A FreqTab Pldt . a = FreqTabp. p(a) 

Furthermore, plist consists of all attributes Pioft-cioft, with <7i c ft £ Sioft; all 
attributes P.a, with a G S; together with the attributes FreqTab P .freq and 
FreqTab P .freq/ FreqTabp^ .freq. 
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Referring back to our overall algorithm (Section 15.21) . we thus generate, for 
each pattern P from loop 2 and each containment mapping / in loop 3, all 
association rules with the given Qicft as lhs in parallel, by one relational database 
query (which can be implemented by a simple SQL select-statement). 

Example. Consider Qicft and P = (II, S) as shown in Figures 22(a) and 22(b) 
We have £i e ft = {01,(74} and IIi c ft = {x3,x§}, and £ = {<7i, 04, 05} and II = 
{2:3}. Take the following containment mapping / from Pi c f t to P: 



f 





01 


X\ 




3 3 


3 


0-4 


04 


x 5 




3 6 


3 


x 7 


04 



Then the rhs query Q r i g ht equals {{x2 , X2 , 04 ) , P) , and the relational algebra 
expression for computing all parameter assignments and their corresponding 
frequencies and confidences looks as follows: 



TTpUst a ggWfAg >m ^ c A FreqTabp M g FreqTabp) 

FreqTabp^ . freq — " uncu 'V 

with plist equal to 

FreqTab p, f -01, FreqTabp . a '4, FreqTab p. 01, FreqTabp. a 4,, FreqTab p. 05, 
FreqTab P .freq, FreqTab P . freq/ FreqTab P .freq 

and equal to 

FreqTab p.ai = FreqTab P . a 1 A FreqTab P .o± = FreqTabp . a ± 
In SQL, we get: 

SELECT freqQleft.xl, freqQleft .x4, freqP.xl, freqP.x4, 
freqP.x5, freqP.freq, freqP.freq/freqQleft.freq 
FROM freqP, freqQleft 

WHERE freqQleft .xl= freqP.xl AND freqQleft .x4=freqP.x4 
AND freqP.freq/freqQleft.freq >= minconf 



5.5 Example Run 

In this Section we give an example run of the algorithm discussed in Section [S] 
We use the same data graph G, unordered rooted tree T, and minimum support 
threshold, 3, as in the example run in Section [4.4.41 The fixed lhs tree query 
is given in Figure 23(a)| its corresponding frequency table in Figure |23(b)[ and 



the minimum confidence threshold is 30%. All frequent tree patterns based on 
T were already generated in the example run of Section 14.5.51 
The example run then looks as follows: 
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Figure 22: Example Qi e ft and P. 
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(b) 

Figure 23: The fixed lhs and its frequency table for the example run in Sec- 
tion [ITU 
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5.6 Equivalent Association Rules 



In this section, we make a number of modifications to the algorithm described 
so far, so as to avoid duplicate work on equivalent rules. 

Let us first look at an example of the duplicate work that the algorithm 
presented until now performs. Consider Qi c ft, Qi = (/i(i?i e ft) ; P), Qi = 
{h(H\ e ft),P)\ and Q 3 = (f 3 (H\ eft ) , P) in Figure[M]with fx, f 2 and / 3 as follows: 
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Furthermore, consider pARl: Qi c ft Qi; pAR2: Q\ c ft => Q2 and pAR3: 

Qlcft => <?3- 

The confidence of the first rule (pARl) equals the proportion of tuples from 
the answer set of Qi e ft where the values for variables X2 and X3 are equal (in the 
rhs those equal variables are represented by variable U2 , and the lhs variable Xi 
is represented by the rhs variable m 3 ). Similarly, the confidence of the second 
rule (pAR2) equals the proportion of tuples from the answer set of Qi c ft where 
the values for the variables x 3 and Xi are equal (again the equal lhs variables 
x 3 and Xi are represented by the rhs variable M2, and the lhs variable X2 is 
represented by the rhs variable M3). Since, due to the symmetry in the lhs 
pattern, the columns for X2, X3 and Xi are fully interchangeable in the answer 
set of Qieft, both rules convey precisely the same information: their confidences 
are equal. The third rule (pAR3) is yet another representation of the same 
association, but now the equal lhs variables X3 and Xi are represented by the 
rhs variable 113. Again, it has the same confidence as pARl and pAR2. 

It is important to note that the above pARs only differ in the containment 
mappings fx, f 2 and /3 that generate the rhs head. The algorithm discussed 
until now generates all these pARs, since we do not perform any check on the 
containment mappings generated in loop 3 of the overall approach fSection !5.2p . 

In this Subsection, motivated by the above example, we consider the general 
problem of when two pARs Qi c ft => Pl Qx and Qi c ft => P2 Q2 are equivalent, where 
Qx and Q 2 are of the form (fi(H\ c { t ), P) and (^(-ffieft), P) for some common 
rhs pattern P, and containment mappings fx and /a from Pi c f t to P. (Thus px 
is /i|£i Bft and P2 is /2|s le(t -) Since such two pARs differ only in fx and f% we 
can actually focus on fx and fa. 

It is important to remember for the rest of this Subsection that Pi c ft and 
P are arbitrary but fixed. Furthermore, without loss of generality we assume 
that the nodes of Pi e ft and P are disjoint. This assumption greatly simplifies 
the representation of containment mappings by graphs, as we will see shortly. 



Equivalent Containment Mappings Recall from Section [4.5 . 31 that an iso- 
morphism from a parameterized tree pattern Px to a parameterized tree pattern 
P2 is a homomorphism from Px to P2 that is a bijection and that maps distin- 
guished nodes to distinguished nodes, parameters to parameters and existential 
nodes to existential nodes. We now formalize equivalent containment mappings 
as follows: Two containment mappings fx and are equivalent if the structures 
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{xi,X 2 ,X 3 ,Xj) (ui,U 2 ,U 2 ,U 3 ) 

Xl Ui 

/{\ / \ 

X 2 X 3 Xi u 2 u 3 



(a) Q lcft (b) Q 



(ui,u 3 ,u 2 ,u 2 ) (ui,u 2 ,u 3 ,u 3 ) 

U\ Ui 

/ \ / \ 

u 2 u 3 u 2 u 3 

(c) Q 2 (d) Q 3 

Figure 24: Queries to illustrate the duplicate work in the association mining 
algorithm 



Xl Ui 

/ \ 
✓ \ 

x 2 ' X 3 ' X4, U2 u 3 



(a) 



xi m 

/ \ 

✓ \ 

X2 ' x 3 ' X4, u 2 u 3 



(b) 

Figure 25: The graph representations of fi and f 3 . 

(Pieft, P, fi) and (Pieft, P, /2) are isomorphic. Specifically, there must exist iso- 
morphisms (actually automorphisms) g : Pi e f t — > Pi e f t and h : P — > P such that 
/ 2 o g = h o fi . 

Consider for instance /i and f 3 from the example above, then h swaps u 2 
and u 3 , and g is the cyclic permutation u 2 h-> m 3 m 4 u 2 . 

5.6.1 Testing for equivalence 

To test for equivalent containment mappings efficiently, we represent them using 
graphs. 

Graph representation of a containment mapping The graph represen- 
tation of a containment mapping / : Pi e f t — > P is a directed, edge- and vertex- 
colored graph, with set of vertices Vf — Vertices (-Pieft) U Vertices(P) and set of 
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edges Ef — Edges(Pi e ft)UEdges(P)U{(u, w) | f(v) = w} (with the understand- 
ing that the edges of Pi e ft and P go from parent to child). We use different 
colors for the edges of Pi e ft> the edges of P and the pairs in /, and we also 
use different colors for the distinguished nodes, the existential nodes and the 
parameters. 

As an illustration, Figure shows the graph representation of fx and f% 
from our example in the introduction above. 

Graph Isomorphism Two graphs G\ — (Vi,Ei) and Gi = (^,£2) are 
colored isomorphic if there exists a bijection <p : Vi — > V2, extended to edges 
(v,w) G Ei in a natural way by (p(v,w) — (<p(v),ip(w)), such that the col- 
ors of vertices and edges are preserved by tp, and such that (v, ai) £ £1 <S 
(ip(v),tp(w)) G E 2 . 

The following Lemma shows then the utility of the colored graph represen- 
tation of containment mappings. 

Lemma 6. Two containment mappings are equivalent if and only if their colored 
graph representations are isomorphic. 

Proof. Let us start with the only-if direction. Consider two equivalent contain- 
ment mappings fx, f% from P e f t to P. By definition of equivalent containment 
mappings, there exist isomorphisms g : P\ c it —> P ft and h : P — > P such that 
f 2 og — ho fx. Now take tp = gUh. Then, tp is clearly a bijection from Vf 1 to V/ 2 , 
and clearly preserves the colors of vertices and edges of Gf 1 . Let (v,w) e Ef 1 . 
We show that tp is indeed an isomorphism from G t t to G/ 2 . There are three 
possibilities: 

1. (v,w) G Edges(Pieft). Note that then also g(v,w) € Edges(Pi e ft). We 
have: 

{v,w) G E h (v,w) G Edges(Pi e ft) 

g is an automorphism in P e f t 
4^ g(v, w) G Edges(P left ) 
ip(v, w) = g(v, w) 
g(v, w) G Eft 

<p{v, w) G Eft 
ip(v,w) G Ef 2 

2. (v,w) G Edges(P). Note that then also h(v,w) G Edges(P). We have: 

(v,w) G Eft <^> (v,w) G Edges(P) 

h is an automorphism in P 

g(v, w) G Edges(P) 

ip{y, w) = h{v, w) 

h(v, w) G Eft 

<p(y,w) g Ef 2 

^ ip(v, w) G Eft 
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3. w = fi(v). We have: 



(v,w) e E fl v = fi(w) 

«*• h(v) = h(h(w)) 

h(v) = h{g{w)) 
&<p{v) = f 2 {cp(w)) 

(<p(v),<p(w)) G E h 

So we can conclude that Gf 1 and G/ 2 are indeed colored isomorphic. 

Let us now look at the if-direction. Let ip be the given isomorphism from 

G h to G h- Now take 9 = <^lvcrtico S (p loft ) and h = <p| Vc rticc S (p)- To prove that 
fi and $2 are equivalent it suffices to show that: 

1. g is an isomorphism from P e f t to Pi c ft; 

2. h is an isomorphism from P to P; and 
3- h °9 = ho f x . 

Items 1 and 2 hold because ip preserves the colors. For 3, let v £ P c ft- Since 
if is a graph isomorphism f2(ip(v)) equals ip(fi(v)). We then have: 

f 2 (g(v)) - / a (^(«)) 
= v(/i(«)) 
= ft(/i()«) 

□ 

So, using graph isomorphism (to be precise, edge and vertex colored directed 
graph isomorphism), we can test for equivalence. Since our patterns are not 
very large, fast heuristics for graph isomorphism can be used. We use the 
program Nauty [351 133] , which is considered as the fastest heuristic for graph 
isomorphism. Nauty is very efficient for small, dense random graphs |15j . Since 
our graph representations are typically small (no more than 20 vertices) and 
dense, this works well in our case. 

Theoretically this situation is not entirely satisfying, as graph isomorphism 
is not known to be efficiently (polynomial-time) solvable in general. We can 
show however that equivalence of our containment mappings is really as hard as 
the general graph isomorphism problem. This hardness argument is presented 
in the following Section 15.6.21 As special case of the equivalence problem that 
is solvable in polynomial time is presented in Section [5.6.31 

5.6.2 Hardness argument 

First recall from graph theory that a graph B = (V, E) is bipartite if V can be 
split in two disjoint parts, V = V a U V b with V a n V b = 0, such that for each 
(v,w) € E then v G V a and w G V b . The vertices in V a are called lhs vertices 
and those in V& rhs vertices (left-hand side, right-hand side). 

We first reduce the problem of bipartite graph isomorphism to equivalence 
of our containment mapping. Let B± = (Vi,Ei) and B2 = (^2,^2) be bipartite 
graphs. We describe an efficient construction that produces from B\ and B 2 
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two association rules {Picft, P, fi) and (Pun, P, f 2 ) such that Bi and B 2 are 
isomorphic if and only if the association rules are equivalent. This construction 
reduces the bipartite graph isomorphism problem to equivalence of containment 
mappings. 

Without loss of generality, we assume that B\ and B 2 have precisely the 
same multiset of outdegrees (for vertices of and V 2 a ), and precisely the same 
number of vertices in and V 2 . Indeed, if these conditions are not satis- 
fied, then B\ and B 2 are never isomorphic and our reduction can output some 
arbitrary Pi oft , P, /i and f 2 as long as [P ic f t ,P, fi) and (P left ,P,f 2 ) are not 
equivalent. 

The construction in now as follows. By the premisses on B\ and B 2 , we 
may assume, without loss of generality, that V" — V 2 and = V 2 . This can 
be accomplished by sorting the lhs vertices in each graph on their outdegrees 
an then numbering them arbitrarily (the rhs vertices can simply be numbered 
arbitrarily) . 

1. Construction of Picft- This is a tree with root called ri e ft and as children of 
the root, all lhs vertices. Moreover, each lhs vertex v has its own children 
as follows: if v has outdegree o, then v has o children denoted by [v, 1], 
[v,2], [v,o\. 

2. Construction of P. This is a tree with root called r r i g ht, and exactly one 
child of the root, called c. Moreover, c has as children precisely all rhs 
vertices. 

3. Construction of f\. We define /(neft) : = r right, and define fi(v) := c 
for each lhs vertex v. Now for each such v, and all outgoing edges 
(v,wi), (v,w 2 ),...,{v,w ) in B\, listed in some arbitrary order, we define 
fi([v,i]) : = w n for i = 1)2, -,o. 

4. The construction of f 2 is analogous to that of /i, but now we look at the 
outgoing edges in B 2 . 

The construction is illustrated in Figure [55] for two bipartite graphs B\ and 

B 2 . 

We now show the correctness of our reduction. 

Lemma 7. B\ and B 2 are isomorphic if and only if (Pi e ft, P, fi ) and (Pieft, P, $2) 
are isomorphic. 

Proof. For the only- if direction, let ip be an isomorphism from B\ to B 2 . We 
define an isomorphism from (Pieft, -P, /l) to {Picft, P, f 2) as follows: 

• V?( r left) = Heft, ^(^right) = ^ight and tp(c) = c; 

• <p(v) = ip{ v ), f° r an y vertex of B±; 

• for any lhs vertex v of outdegree o, and any i = 1, 2, o, let w be the rhs 
vertex such that fi([v, i]) = w. Then we define tp{[v, i]) := [ip{v),j], where 
j is such that f 2 ([ip(v), j]) = 4>{w). 

To verify that ip is indeed an isomorphism, we only check that u = fi([v, i]) 
i/)(u) = f 2 {tp{[v,i])). If u = fi{[v,i]) then (v,u) is an edge in B\ and thus 
(cp(v),(p(w)) — (i>(v), i^{w)) is an edge in B 2 . Hence there exists a j such that, 
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(a) B 1 



(b) B 2 



Heft 



[V!,l] [Vt,2] [V 2 ,l] [V 2 ,2] [ U3 ,l] 



(c) Pleft 

Heft 



bright 

1 

A\ 

W\ IL>2 W3 
(d) P 



bright 

J 



Vi ■ V2 ' V 3 ■ V 4 ' 

[«i,l]"[t;i,2] - [«2,l3 - [w2,2] - [i;3,l] - " 

(e) /1 



w 2 W3 



Heft 



bright 

J 



A... A:: I- 

[wi,l]'[wi,2]'[w 2 ,l]'[w2,2]-[w 3 ,l]"" 

(f) /2 



Figure 26: Illustration of the construction of pARs from bipartite graphs. 
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tp(u) = f2{[(fi(v), j}), or equivalent, ip{u) = f2([ip(v), j]). By definition of ip we 
have <p([v, i\) — [ip(v),j] and thus (p(u) = f2(tp([v, i])) as desired. Conversely, 
suppose ip(u) — f2( ( p([v, «]))• By definition of ip, we have <p([v, i]) equals \ip{v),j] 
for some unique j, and fa([^p(v),j]) equals ip(fi([v, i])). Hence, ip(u) — ( fi( u ) = 
ip(fi{[v,i])), and thus u — as desired. 

For the if-direction, let ip be an isomorphism from (Pi e ft, P, fi) to (-FWt, P,f2)- 
We define an isomorphism ip from B\ to as follows. Actually, ip is simple ip 
restricted to the vertices of B\. Indeed, 

(v,w) eE 1 <*3i: fi([v,i]) = w 

: f 2 (v([v,i])) = tp(w) 
3j : f2{<p(v),j) = ip(w) 

□ 

We can already conclude from this reduction that equivalence of pARs is 
really as hard as isomorphism of bipartite directed graphs. The latter problem, 
however, is well-known to be as hard as isomorphism of general directed graphs. 
Indeed, any directed graph G = (V, E) can be transformed into the bipartite 
directed graph B(G) := (VUE, {(v, (v, w)) | (v,w) e E}U{((v,w),w) | (v,w) £ 
E}), and it is easily verified that G\ and G2 are isomorphic if and only if B(G\) 
and B(G2) are isomorphic. 

So, we can now conclude that equivalence of our pARs is really as hard as 
the general graph isomorphism problem. But as we show next, we can still 
capture an important special case in polynomial time, so that the general graph 
isomorphism heuristics only have to be applied on instances not captured by 
the special case. 

5.6.3 Polynomial case 

The special efficient case is to check whether (P\ett)P>fi) and (Pieft; P> f 2 ) are 
already isomorphic with g the identity, i.e., whether the structures (P, /1) and 
(P, ^2) are already isomorphic. So, we look for an automorphism h of P such 
that /2 = h o f 1 . This can be solved efficiently by a reduction to node-labeled 
tree isomorphism. As explained in Section ^. 4[ if we know the tree T underlying 
P, then P is characterized by the pair (II, E), and thus (P, /) is characterized 
by (II, S, /). We can view this triple as a labeling of T, as follows. We label 
every node y of P with a triple (bu, bs, / _1 (y))j where bu is a bit that is I iff 
y € II; 6s is a bit that is defined likewise; and f~ 1 (y) is the set of nodes of Poft 
that are mapped by / to y. Then (P, /1) and (P, f 2 ) are isomorphic if and only 
if the corresponding node-labeled trees are isomorphic, and the latter can be 
checked in linear time using canonical ordering [U [TU] . 

5.6.4 The Algorithm 

We are now in a position to describe how our general algorithm must be modified 
to avoid equivalent association rules. There is only extra checking to be done 
in loop 3 (recall Sections 15.21 and 15. 3p . For each new containment mapping / 
from P e ft to P, we canonize the corresponding node-labeled tree and we check if 
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the canonical form is identical to an earlier generated canonical form; if so, / is 
dismissed. We can keep track of the canonical forms seen so far efficiently using 
a trie data structure. If the canonical form was not yet seen, we can either let / 
through to loop 4, if the presence of duplicates in the output is tolerable for the 
application at hand, or we can perform the colored graph isomorphism check of 
Section 15.6.11 with the containment mappings previously seen, to be absolutely 
sure we will not generate a duplicate. 

6 Certhia: Pattern and Association Browsing 

In this Section we introduce an interactive tool, called Certhia, for browsing the 
frequent tree patterns, and generating association rules. 

As already noted in Section l4.61 the result of our tree query mining algorithm 
in Section 2] is a structured database, called a Pattern Database, containing all 
frequency tables for each tree T that was investigated. This pattern database 
is an ideal platform for an interactive tool for browsing the frequent queries. 
However, this pattern database is also an ideal platform for generating associ- 
ation rules as explained in Section [5T21 since the first two loops of association 
rule algorithm are exactly our tree query mining algorithm. 

In a typical scenario for Certhia, the user draws a tree shape, marks some 
nodes as existential, marks some others as parameters, instantiates some param- 
eters by constants, but possibly also leaves some parameters open. The browser 
then returns, by consulting the appropriate frequency table in the database, all 
instantiations of the free parameters that make the pattern frequent, together 
with the frequency. The user can then select one of these instantiations, set 
a minconf value, and ask the browser to return all rhs's that form a confident 
association with the selected pure tree query as lhs. In another scenario the 
user lets the browser suggest some frequent tree patterns to choose from as an 
lhs. 

Some screenshots of Certhia are given in Figure \T7\ Figure 1251 and Figure 1251 

• In Figure 1271 the user draws a tree, marks some nodes nodes as existential, 
some others as parameters, instantiates some parameters with constants, 
and asks the browser to return all possible instantiations of the remaining 
parameters and the corresponding frequencies. 

• In Figure [2H the user asks the browser to return all association rules for 
a fixed lhs. The user selects a rhs in the dialog box and asks the browser 
to return the instantiations and the corresponding frequencies. 

• In Figure [531 the browser suggests some frequent tree patterns where the 
user can choose from. 

Efficiency The preprocessing step, i.e., the building up of the pattern database 
with frequent tree patterns, is of course a hugely intensively task. First because 
the large datagraph must be accessed intensively, and secondly because the num- 
ber of frequent patterns is huge. In Section [7.21 we show that this preprocessing 
step can be implemented with satisfactory performance. Also, in scientific dis- 
covery applications it is no problem, indeed typical, if a preprocessing step takes 
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Figure 29: Screenshot of Certhia: the browser suggests frequent tree patterns. 

a few hours, as long as after that the interactive exploration of the found re- 
sults can happen very fast. And indeed we found that the actual generation of 
association rules is very fast. This is also shown in Section I7T21 

7 Experimental Results 

In this section, we report on some experiments performed using our prototype 
implementation applied to both real-life and synthetic datasets to show that our 
approach is indeed workable. 

7.1 Real-life datasets 

We have worked with a food web, a protein interactions graph, and a citation 
graph. For each dataset we built up a pattern database using the following 
parameters: 





#nodes 


#edges 


k 


size 


food web 


154 


370 


25 


6 


proteins 


2114 


4480 


10 


5 


citations 


2500 


350000 


5 


4 



As we set rather generous limits on the maximum size of trees, or on the mini- 
mum frequency threshold, each run took several hours. 

The food web [34 comprises 154 species that are all directly or indirectly 
dependent on the Scotch Broom (a kind of shrub). One of the patterns that 
was mined with frequency 176 is the following: 
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20 

xi — *■ 3 

x 2 

20 

This is really a rather arbitrary example, just to give an idea of the kind of 
complex patterns that can be mined. Note also that, thanks to the constant 20 
appearing twice, this is really a non-tree shaped pattern: we could equally well 
draw both arrows to a single node labeled 20. 

While we were thus browsing through the results, we quickly noticed that 
the constant 20 actually occurs quite predominantly, in many different frequent 
patterns. This constant denotes the species Orthotylus adenocarpi, an omnivo- 
rous plant bug. To confirm our hypothesis that this species plays a central role 
in the food web, we asked for all association rules with the following left-hand 
side: 

(x 1 ,x 2 ) {xi,x 2 ) 



x\ x\ 

I * 

3 3 

I I 

3 20 

* * 

3 3 

\ \ 

x 2 x 2 

Indeed, the rule shown above turned up with 89% confidence! For 89% of all 
pairs of species that are linked by a path of length four, Orthotylus adenocarpi 
is involved in between. 

Two other rules we discovered are: 

(x 1 ,x 2 ,x 3 ,x 4 , x 5 ) (0, x 2 , x 3 , Xi, x 5 ) 



x 1 

x 3 ^ x 3 

\ \ 

Xi Xi 

x 5 x 5 



(x 1 ,x 2 ,x 3 ,x i ,x 5 ) 



(x!,X 2 ,X 3 ,Xi,X 5 ) 





Xl \ 
\ Xl 

? » ♦ 

x 3 \ 
\ x 3 
x 4 | 

| X4 

x 5 \ 

x 5 
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Since 45% + 55% = 100%, these rules together say that each path of length 5 
either starts in 0, or one beneath 0. This tells us that the depth of the food web 
equals 6. Constant turns out to denote the Scotch Broom itself, which is the 
root of the food web. 

Another rule we mined, just to give a rather arbitrary example of the kind 
of rules we find with our algorithm, is the following: 

(xi,X 2 ,X 3 ,X 4 ,X 5 ) (xi,X 2 ,X 4 ,X 2 ,X 5 ) 

Xl 

\ 

X2 

101 Xi X5 

The protein interaction graph [25] comprises molecular interactions (sym- 
metric) among 1870 proteins occurring in the yeast Saccharomyces cerevisiae. In 
such interaction networks, typically a small number of highly connected nodes 
occur. Indeed, we discovered the following association rule with 10% confidence, 
indicating that protein #224 is highly connected: 



Xl 

/ \ 

x 2 x 4 

{ { 

x 3 x 5 



11% 



Xl Xl 

3 => 224 

\ \ 

x 2 x 2 

We also found the following rule: 

(xi,x 2 ) (xi,x 2 ) 



7 90% 7 
v . I 

x 2 ^ x 2 

I / \ 

746 746 376 

This rule expresses that almost all interactions that link to protein 746 also link 
to protein 376, which unveils a close relationship between these two proteins. 

The citation graph comes from the KDD cup 2003, and contains around 
2500 papers about high-energy physics taken from arXiv.org, with around 350 000 
cross-references. One of the discovered patterns is the following, with frequency 
1655, showing two papers that are frequently cited together (by 6% of all pa- 
pers). 

Xl 

9711200 9802150 
One of the discovered rules is the following: 

(xi,x 2 ) (xi,x 2 ) 



Xl 1 -jw Xl 

/ \ 15 J° \ 

3 x 2 x 2 
3 9503124 
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Figure 30: Performance on Web graphs. 

This rule shows that paper 9503124 is an important paper. In 15% of all "non- 
trivial" citations (meaning that the citing paper cites at least one paper that 
also cites a paper), the cited paper cites 9503124. 

7.2 Performance 

While our prototype implementations are not tuned for performance, we still 
conducted some preliminary performance measurements, with encouraging re- 
sults. The experiments were performed on a Pentium IV (2.8GHz) architecture 
with 1GB of internal memory, running under Linux 2.6. The program was writ- 
ten in C++ with embedded SQL, with DB2 UDB v8.2 as the relational database 
system. 

We have used two types of synthetic datasets. 

Random Web graphs Naturally occurring graphs (as found in biology, soci- 
ology, or the WWW) have a number of typical characteristics, such as sparseness 
and a skewed degree distribution [35] . Various random graph models have been 
proposed in this respect, of which we have used the "copy model" for Web 
graphs [27j [6]. We use degree 5 and probability a — 10% to link to a random 
node (thus 90% to copy a link). 

On these graphs, we have measured the total running time of the tree query 
mining algorithm as a function of the size (number of edges) of the graph, where 
we mine up to tree size 5, with varying minimum frequency thresholds of 4, 10, 
and 25. The results, depicted in Figure 1301 show that the performance of these 
runs is quite adequate. 



63 




Figure 31: Performance in terms of number of discovered patterns. 



Uniform random graphs We have also experimented with the well-known 
Erdos-Renyi random graphs, where one specifies a number n of nodes and gives 
each of the possible n 2 edges a uniform probability (we used 10%) of actually 
belonging to the graph. In contrast to random Web graphs, these graphs are 
quite dense and uniform, and they serve well as a worst-case scenario to measure 
the performance of the tree query mining algorithm as a function of the number 
of discovered patterns, which will be huge. 

We have run on graphs with 47, 264, and 997 edges, with minimum frequency 
thresholds of 10 and 25. The results, depicted in Figure [3TJ show, first, that 
huge numbers of patterns are mined within a reasonable time, and second, that 
the overhead per discovered pattern is constant (all six lines have the same 
slope). 

On these uniform random graphs we also conducted some experiments to 
check the performance of the association rule mining algorithm. We found the 
actual generation of association rules (i.e., loops 3 and 4, assuming that a pattern 
database is already build up) to be very fast. For instance, Figure [32] shows the 
performance of generating association rules for two different (absolute) values of 
minconf, against a frequency table database built up for a random graph with 
33 nodes and 113 edges, an absolute minsup of 25, and all trees up to size 7. We 
see that associations are generated with constant overhead, i.e., in linear-output 
time. The coefficient is larger for the larger minconf, because in this experiment 
we have counted instantiated rhs's, and per rhs query less instantiations satisfy 
the confidence threshold for larger such thresholds. Had we simply counted rhs's 
regardless of the number of confident instantiations, the two lines would have 
had the same slope. 
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Figure 32: Performance in terms of number of discovered rules. 



Performance issues One major performance issue that we have not ad- 
dressed in the present study is that some of the SQL queries that are performed 
due to pattern generation take a very long time (in order of hours) to answer by 
the database system. This happens in those cases where the data graph is large 
(5000 edges or more) with many cycles, and the candidate patterns are large 
(6 nodes are more). Certainly, some SQL queries can be hand-optimized (or 
replaced by a combination of simpler queries), to alleviate these performance 
problems, but we leave this issue to future research. 
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Appendix 



Notation 


Interpretation 


U 


set of data constants 


T 


ordered rooted tree 


G 


data graph 


P 


parameterized tree pattern 


n 


set of existential nodes 


A 


set of distinguished nodes 


£ 


set of parameters 


3 


existential node 


a 


parameter 


X 


distinguished node 


a 


parameter assignment 


P a , (P,a) 


instantiated tree pattern 


P a {G) 


{fi a : M i s a matching of P a in G} 


minsup 


the frequency threshold 


Q = (H,P) 


parameterized tree query with H the head and P the body 




instantiated tree query 


Q a (G) 


answer set of the instantiated tree query Q a in G 


p 1 


parameter correspondence 


Q2 (=p Qi 


Qi is p-contained in Q\ 


hcczcp(P) 


the freezing of a tree pattern P 


pAR 


parameterized association rule 


iAR 


instantiated association rule 


Qi Q2 


pAR from Qi to Q 2 


(Qi => P Qi,oi) 


iAR from Qi to Q 2 


minconf 


the confidence threshold 


Freq(P a ) 


the frequency of P a in G 


(II, E) 


a parameterized tree pattern P based on a fixed tree T 


(II, E, a) 


an instantiated tree pattern P based on a fixed tree T 


CanTabu.j: 


{a P a is a candidate instantiated tree pattern} 


FreqTab n s 


{a \ P a is a frequent instantiated tree pattern} 


S 


answer set correspondence 


Pi ^ P P2 


Pi is (5, p)-equivalent with P 2 


P? 2 (G)o6 


{foS: fEPr(G)} 


Pl=P2 


Pi and P 2 arc isomorphic 
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